<a href="https://colab.research.google.com/github/AyushaKakkad/AyushaKakkad.github.io/blob/master/DA_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###**Installation Of Packages and Libraries**

In [None]:
!pip install tensorflow

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk
import re
nltk.download('punkt')
nltk.download("stopwords")
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,classification_report
from sklearn.naive_bayes import ComplementNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

In [None]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

###**Loading of Dataset**

In [None]:
!wget -O review.json "https://figshare.com/ndownloader/files/41333352"           #review yelp dataset
!wget -O business.json "https://figshare.com/ndownloader/files/41402280"         #business yelp dataset

###**Reading Dataset**

In [5]:
data2_business = pd.read_json("/content/business.json",lines = True)

In [6]:
data4_review = []
r_dtypes = {"stars": np.int32,
            "useful": np.int32,
            "funny": np.int32,
            "cool": np.int32,
           }
with open("review.json", "r") as f:
    review = pd.read_json(f, lines=True,
                          dtype=r_dtypes, chunksize=1000)

    for chunk in review:
        reduced_chunk = chunk.drop(columns=['useful','cool','funny'])
        data4_review.append(reduced_chunk)

data4_review = pd.concat(data4_review, ignore_index=True)

###**Calling Dataset**

In [None]:
data2_business

In [None]:
data4_review

###**Pre-Processing Steps**

####Drop unwanted columns

In [None]:
drop_column = ['longitude','latitude']
data2_business = data2_business.drop(drop_column,axis=1)
data2_business

In [None]:
data2_business = data2_business.drop(data2_business[data2_business['is_open']==0].index)
data2_business

In [None]:
drop_column = ['attributes','hours']
combined_review_business = combined_review_business.drop(drop_column,axis=1)
combined_review_business

####Short Listing Top 15 categories

In [11]:
data2_business_category = data2_business.assign(categories = data2_business.categories
                          .str.split(', ')).explode('categories')

In [None]:
print("Top 15 categories are:")
top15_business_category = data2_business_category['categories'].value_counts().head(15)
top15_business_category

In [13]:
top15_business_category = data2_business.categories.value_counts()[:15].index.tolist()

####Filtering dataset based on top 15 categories

In [14]:
filtered_data2_business = data2_business[data2_business.categories.isin(top15_business_category)]

####Merge Business and Review Dataframe

In [15]:
combined_review_business = pd.merge(data4_review, filtered_data2_business, on='business_id',how='inner')

In [None]:
combined_review_business

In [None]:
combined_review_business = combined_review_business.rename(columns={'stars_x': 'review_stars'})
combined_review_business = combined_review_business.rename(columns={'stars_y': 'business_stars'})
combined_review_business['business_stars'] = combined_review_business['business_stars'].astype(int)
combined_review_business

#### Calculate Review Text Length

In [None]:
combined_review_business['text length']= combined_review_business['text'].apply(len)
combined_review_business.sample(2)
combined_review_business = combined_review_business[combined_review_business['text length'] <= 1000].copy()
combined_review_business

####Exploratory Data Analysis

In [None]:
plt.figure(figsize=(10,5))
combined_review_business['review_stars'].value_counts().plot(kind='bar',alpha=0.5,color='blue',label='ratings')
plt.legend()
plt.xlabel("Review Stars")
plt.ylabel("Reviews Count")
plt.title('BarChart for Star Reviews vs Reviews Count')
plt.yscale('log')
plt.show()

In [None]:
plt.figure(figsize=(10, 5))
combined_review_business.groupby('state')['review_stars'].value_counts().unstack().plot(kind='bar', stacked=True, alpha=0.7)
plt.legend(title='Review Stars')
plt.xlabel("State")
plt.ylabel("Reviews Count")
plt.title('Bar Chart for Star Reviews vs Geographic Location')
plt.show()

In [None]:
plt.figure(figsize=(12,5))
combined_review_business[combined_review_business['review_stars']==1]['text length'].plot(bins=35,alpha=0.5,kind='hist',color='red',label='rating 1')
plt.legend()
plt.title('Histogram for 1 Star Reviews vs Text Length')
plt.xlabel('Text length')
plt.show()

In [None]:
plt.figure(figsize=(12,5))
combined_review_business[combined_review_business['review_stars']==2]['text length'].plot(bins=35,alpha=0.5,kind='hist',color='red',label='rating 2')
plt.legend()
plt.title('Histogram for 2 Star Reviews vs Text Length')
plt.xlabel('Text length')
plt.show()

In [None]:
plt.figure(figsize=(12,5))
combined_review_business[combined_review_business['review_stars']==3]['text length'].plot(bins=35,alpha=0.5,kind='hist',color='red',label='rating 3')
plt.legend()
plt.title('Histogram for 3 Star Reviews vs Text Length')
plt.xlabel('Text length')
plt.show()

In [None]:
plt.figure(figsize=(12,5))
combined_review_business[combined_review_business['review_stars']==4]['text length'].plot(bins=35,alpha=0.5,kind='hist',color='red',label='rating 4')
plt.legend()
plt.title('Histogram for 4 Star Reviews vs Text Length')
plt.xlabel('Text length')
plt.show()

In [None]:
plt.figure(figsize=(12,5))
combined_review_business[combined_review_business['review_stars']==5]['text length'].plot(bins=35,alpha=0.5,kind='hist',color='red',label='rating 5')
plt.legend()
plt.title('Histogram for 5 Star Reviews vs Text Length')
plt.xlabel('Text length')
plt.show()

####LowerCase Reviews Text

In [None]:
combined_review_business['text'] = combined_review_business['text'].str.lower()
combined_review_business

####Function to remove punctuations,special characters,tokenization,remove stopwords

In [33]:
def spaces_remove(text):
    '''Removes awkward spaces'''
    text = text.strip()
    text = text.split()
    return " ".join(text)

def puncClean(text):
    '''function to clean the word of any punctuation or special characters'''
    clean_punc = re.sub(r'[?|!|\'|"|#]',r'',text)
    clean_punc = re.sub(r'[.|,|)|(|\|/]',r' ',clean_punc)
    clean_punc = clean_punc.strip()
    clean_punc = clean_punc.replace("\n"," ")
    return clean_punc

def keepAlphabets(text):
    alphabet_sent = ""
    for word in text.split():
        alphabet_word = re.sub('[^a-z A-Z]+', ' ', word)
        alphabet_sent += alphabet_word
        alphabet_sent += " "
    alphabet_sent = alphabet_sent.strip()
    return alphabet_sent

def stopwordRemove(text):
    words = nltk.word_tokenize(text)

# Getting english words list
    stop_words = set(stopwords.words('english'))
# Eliminating stop words
    filtered_words = [word for word in words if word.lower() not in stop_words]
# Combine the filtered words to a sentence
    filtered_sentence = ' '.join(filtered_words)
    return filtered_sentence

def text_preprocess(text):
    '''Cleaning and parsing the text.'''
    text = spaces_remove(text)
    text = puncClean(text)
    text = keepAlphabets(text)
    text = stopwordRemove(text)
    return text

In [34]:
combined_review_business['text'] = combined_review_business['text'].apply(text_preprocess)

In [None]:
combined_review_business.head()

####Drop Null Rows

In [36]:
df_cleaned = combined_review_business.dropna(how='any')

In [None]:
print("Missing values in df_cleaned:")
print(df_cleaned.isnull().sum())


####Generate Plutchiks Wheel of Emotions Column in Dataframe

In [38]:
emotion_list = ['joy','trust','fear','surprise','sadness','anticipation','anger','disgust','neutral']


In [39]:
def idx2class(idx_list):
    arr = []
    for i in range(0,idx_list):
        arr.append(emotion_list[int(i)])
    return arr

In [40]:
combined_review_business['Emotions'] = combined_review_business['review_stars'].apply(idx2class)

In [None]:
combined_review_business['Emotions'].head()

####Assign Sentiments to the Dataframe based on Review Stars

In [44]:
strong_positive=[]
positive = []
strong_negative = []
neutral = []
negative=[]

strong_positive_reviews=combined_review_business[combined_review_business['review_stars']==5]['text']
positive_reviews=combined_review_business[combined_review_business['review_stars']==4]['text']
neutral_reviews=combined_review_business[combined_review_business['review_stars']==3]['text']
negative_reviews=combined_review_business[combined_review_business['review_stars']==2]['text']
strong_negative_reviews=combined_review_business[combined_review_business['review_stars']==1]['text']

####WordCloud based on Sentiments

Strong Positive Review(5 Star Reviews)

In [None]:
from wordcloud import WordCloud
pos_review_cloud=WordCloud(width=600,height=400).generate(" ".join(strong_positive_reviews))
plt.figure(figsize=(10,8),facecolor='k')
plt.imshow(pos_review_cloud)
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

Positive Sentiment Review(4 Star Reviews)

In [None]:
from wordcloud import WordCloud
pos_review_cloud=WordCloud(width=600,height=400).generate(" ".join(positive_reviews))
plt.figure(figsize=(10,8),facecolor='k')
plt.imshow(pos_review_cloud)
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

Neutral Sentiment(3 Star Reviews)

In [None]:
from wordcloud import WordCloud
pos_review_cloud=WordCloud(width=600,height=400).generate(" ".join(neutral_reviews))
plt.figure(figsize=(10,8),facecolor='k')
plt.imshow(pos_review_cloud)
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

Negative Sentiment(2 Star Reviews)

In [None]:
from wordcloud import WordCloud
pos_review_cloud=WordCloud(width=600,height=400).generate(" ".join(negative_reviews))
plt.figure(figsize=(10,8),facecolor='k')
plt.imshow(pos_review_cloud)
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

Strong Negative Sentiment(1 Star Reviews)

In [None]:
from wordcloud import WordCloud
pos_review_cloud=WordCloud(width=600,height=400).generate(" ".join(strong_negative_reviews))
plt.figure(figsize=(10,8),facecolor='k')
plt.imshow(pos_review_cloud)
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

Save Dataset

In [45]:
combined_review_business.to_parquet('/content/drive/MyDrive/Colab Notebooks/preprocessed_data.parquet')

In [4]:
combined_review_business=pd.read_parquet('/content/drive/MyDrive/Colab Notebooks/preprocessed_data.parquet')

####Train and Test Split

In [5]:
y=combined_review_business['review_stars']
x=combined_review_business['text']

In [6]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)

In [None]:
print("x_train shape :",x_train.shape)
print("x_test shape  :",x_test.shape)
print("y_train shape :",y_train.shape)
print("y_test shape  :",y_test.shape)

###**Implementation of Models**

###Machine Learning Techniques

####Logistic Regression

In [None]:
cv=CountVectorizer()
lr=LogisticRegression(random_state = 123)
x_train1=cv.fit_transform(x_train)
x_test1 = cv.transform(x_test)
lr.fit(x_train1,y_train)
pred_1=lr.predict(x_test1)
score_1=accuracy_score(y_test,pred_1)

####Cross Validation of Logistic Regression

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
stratified_kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=123)
scores = cross_val_score(lr, x_train1, y_train, cv=stratified_kfold, scoring="accuracy")

In [None]:
print("Cross-Validation Scores:", scores)
print("Average CV Score:", scores.mean())

In [None]:
classification_rep = classification_report(y_test, pred_1)

print("Accuracy:", score_1)
print("Classification Report:\n", classification_rep)

####Naive Bais Classifier

In [83]:
cv=CountVectorizer()
nb=ComplementNB()
x_train1=cv.fit_transform(x_train)
nb.fit(x_train1,y_train)
pred_2=nb.predict(cv.transform(x_test))
score_2=accuracy_score(y_test,pred_2)

In [None]:
classification_rep2 = classification_report(y_test, pred_2)

print("Accuracy:", score_2)
print("Classification Report:\n", classification_rep2)

Cross Validation of Naive Bayes Classifier

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
stratified_kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
scores = cross_val_score(nb, x_train1, y_train, cv=stratified_kfold, scoring="accuracy")

print("Cross-Validation Scores:", scores)
print("Average CV Score:", scores.mean())

####Support Vector Machine(SVM)

In [None]:
cv=CountVectorizer()
svm=SVC()
x_train2=cv.fit_transform(x_train)
svm.fit(x_train2,y_train)
pred_3=svm.predict(cv.transform(x_test))
score_3=accuracy_score(y_test,pred_3)

####Cross Validation of SVM

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
stratified_kfold1 = StratifiedKFold(n_splits=10, shuffle=True, random_state=123)
scores = cross_val_score(svm, x_train2, y_train, cv=stratified_kfold1, scoring="accuracy")

print("Cross-Validation Scores:", scores)
print("Average CV Score:", scores.mean())

In [None]:
classification_rep3 = classification_report(y_test, pred_3)

print("Accuracy:", score_3)
print("Classification Report:\n", classification_rep3)

####Decision Tree

In [8]:
cv = CountVectorizer()
dt=DecisionTreeClassifier(random_state=42)
x_train3=cv.fit_transform(x_train)
dt.fit(x_train3,y_train)
pred_4=dt.predict(cv.transform(x_test))
score_4=accuracy_score(y_test,pred_4)

####Cross Validation of Decision Tree

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
stratified_kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
scores = cross_val_score(dt, x_train3, y_train, cv=stratified_kfold, scoring="accuracy")

print("Cross-Validation Scores:", scores)
print("Average CV Score:", scores.mean())

In [None]:
classification_rep4 = classification_report(y_test, pred_4)
print("Accuracy:", score_4)
print("Classification Report:\n", classification_rep4)

#### K-Nearest Neighbor

In [11]:
from sklearn.neighbors import KNeighborsClassifier

cv = CountVectorizer()
x_train4 = cv.fit_transform(x_train)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(x_train4, y_train)
pred_5=knn.predict(cv.transform(x_test))
score_5 = accuracy_score(y_test, pred_5)

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
stratified_kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
scores = cross_val_score(knn, x_train4, y_train, cv=stratified_kfold, scoring="accuracy")

print("Cross-Validation Scores:", scores)
print("Average CV Score:", scores.mean())

In [None]:
classification_rep5 = classification_report(y_test, pred_5)
print("Accuracy:", score_5)
print("Classification Report:\n", classification_rep5)

####Ensemble Learning Technique - Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

cv = CountVectorizer()

x_train5 = cv.fit_transform(x_train)

rfc = RandomForestClassifier(random_state=42)
rfc.fit(x_train5, y_train)
pred_6=rfc.predict(cv.transform(x_test))
accuracy = accuracy_score(y_test, pred_6)
report = classification_report(y_test, pred_6)

print("Accuracy:", accuracy)
print("Classification Report:\n", report)


####Cross Validation of Random Forest

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
stratified_kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
scores = cross_val_score(knn, x_train4, y_train, cv=stratified_kfold, scoring="accuracy")

print("Cross-Validation Scores:", scores)
print("Average CV Score:", scores.mean())

####Lexicon Based Technique - NRCLex

In [None]:
!pip install NRCLex

In [None]:
from nrclex import NRCLex
combined_review_business['emotions'] = combined_review_business['text'].apply(lambda x: NRCLex(x).affect_frequencies)

In [None]:
combined_review_business = pd.concat([combined_review_business.drop(['emotions'], axis = 1), combined_review_business['emotions'].apply(pd.Series)], axis = 1)
combined_review_business.head()

####Deep Learning Technique - LSTM

In [8]:
vocab_size = 10000
embedding_dim = 64
max_length = 1000
trunc_type = 'post'
padding_type = 'post'
oov_tok = '<OOV>'

In [9]:
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(x_train)

In [10]:
x_train = tokenizer.texts_to_sequences(x_train)
x_test = tokenizer.texts_to_sequences(x_test)

x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, padding=padding_type, truncating=trunc_type, maxlen=max_length)
x_test = tf.keras.preprocessing.sequence.pad_sequences(x_test, padding=padding_type, truncating=trunc_type, maxlen=max_length)


In [None]:
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim),
tf.keras.layers.LSTM(embedding_dim,  return_sequences=True),
tf.keras.layers.LSTM(64),
tf.keras.layers.Dense(embedding_dim, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(1,activation='sigmoid')
])

model.summary()

In [None]:
early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=2, restore_best_weights=True)
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(0.001),
              metrics=['accuracy'])

history = model.fit(x_train, y_train, epochs=10,validation_split=0.1, batch_size=64, shuffle=True, callbacks=[early_stop])

In [None]:
history_dict = history.history

score_5 = history_dict['accuracy']
val_accuracy = history_dict['val_accuracy']
loss1 = history_dict['loss']
val_loss1 = history_dict['val_loss']
epoch1 = history.epoch

plt.figure(figsize=(10,6))
plt.plot(epoch1, loss1, 'r', label='Training loss')
plt.plot(epoch1, val_loss1, 'b', label='Validation loss')
plt.title('Training and validation loss', size=15)
plt.xlabel('Epochs', size=15)
plt.ylabel('Loss', size=15)
plt.legend(prop={'size': 15})
plt.show()

plt.figure(figsize=(10,6))
plt.plot(epoch1, score_5, 'g', label='Training acc')
plt.plot(epoch1, val_accuracy, 'b', label='Validation acc')
plt.title('Training and validation accuracy', size=15)
plt.xlabel('Epochs', size=15)
plt.ylabel('Accuracy', size=15)
plt.legend(prop={'size': 15})
plt.ylim((0.5,1))
plt.show()

In [None]:
model.evaluate(x_test, y_test)

In [None]:
pred_5 = model.predict(x_test)

####Pretrained Model - BERT

In [None]:
! pip install transformers

In [6]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding
!pip install transformers >> NULL

In [7]:
from sklearn.model_selection import train_test_split

# Your code here
train, validate, test = None, None, None
train, test = train_test_split(combined_review_business,random_state=104, test_size=0.40, stratify = combined_review_business["review_stars"], shuffle=True)
test , val = train_test_split(combined_review_business,random_state=104, test_size=0.50, stratify = combined_review_business["review_stars"], shuffle=True)

In [None]:
!pip install transformers[torch]

In [None]:
pip install accelerate==0.20.1

In [13]:
from transformers import Trainer, TrainingArguments
import accelerate
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import accelerate

In [None]:
from sklearn.preprocessing import LabelEncoder
import torch
from torch.utils.data import Dataset
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import Trainer
from transformers import TrainingArguments
import numpy as np

le = LabelEncoder()

# 2. Fit the label encoder to the label in our dataset
le.fit(train["review_stars"])

# 3. Create a new column with encoded labels
train["encoded_label"] = le.transform(train["review_stars"])
val["encoded_label"] = le.transform(val["review_stars"])
test["encoded_label"] = le.transform(test["review_stars"])

train_labels = torch.tensor(train["encoded_label"].tolist())
val_labels = torch.tensor(val["encoded_label"].tolist())
test_labels = torch.tensor(test["encoded_label"].tolist())

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

train_encodings = tokenizer(
  train["text"].tolist(),
  padding=True,           # pad all inputs to max length
  max_length=24,         # Bert max is 512, we choose 24 for computational efficiency
  return_tensors="pt",    # Return format pytorch tensor
  truncation=True
)

train_encodings.keys()
train_encodings

val_encodings = tokenizer(
    val["text"].tolist(),
    padding=True,           # pad all inputs to max length
    max_length=24,         # Bert max is 512, we choose 24 for computational efficiency
    return_tensors="pt",    # Return format pytorch tensor
    truncation=True
)

test_encodings = tokenizer(
    test["text"].tolist(),
    padding=True,           # pad all inputs to max length
    max_length=24,         # Bert max is 512, we choose 24 for computational efficiency
    return_tensors="pt",    # Return format pytorch tensor
    truncation=True
)

class RelationDataset(Dataset):

    def __init__(self, encodings: dict):
        self.encodings = encodings

    def __len__(self) -> int:
        return len(self.encodings["input_ids"])

    def __getitem__(self, idx: int) -> dict:
        e = {k: v[idx] for k,v in self.encodings.items()}
        return e


# Update encodings with labels
train_encodings["labels"] = train_labels
val_encodings["labels"] = val_labels
test_encodings["labels"] = test_labels

# Generate Datasets
train_ds = RelationDataset(train_encodings)
val_ds = RelationDataset(val_encodings)
test_ds = RelationDataset(test_encodings)

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=7)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    lr_scheduler_type='cosine',
    per_device_train_batch_size = 16,
    per_device_eval_batch_size = 16,
    fp16=False,
)

trainer = Trainer(
    model,
    training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
)

trainer.train()

preds = trainer.predict(test_ds)
print(preds)


preds = le.inverse_transform(np.argmax(preds.predictions, axis=1))
print(classification_report(test["review_stars"].tolist(), preds))

####References

1) Tensorflow Install: https://www.tensorflow.org/install/pip

2) Count Vectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

3) Logistic Regression Classifier: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

4) Naive Bayes Classifier: https://scikit-learn.org/stable/modules/naive_bayes.html#complement-naive-bayes

5) Support Vector Machine Classifier: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

6) Load Dataframe to Parquet File: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.html

7) Read Dataframe: https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html

8) Decision Tree Classifier: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

9) KNN Classifier: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

10) Error faced during emotion list creation: https://stackoverflow.com/questions/1841565/valueerror-invalid-literal-for-int-with-base-10

11) Fit Model Error: https://datascience.stackexchange.com/questions/20199/train-test-split-error-found-input-variables-with-inconsistent-numbers-of-sam

12) Advanced Natural Language Processing Lab 4 Material.

13) Advanced Natural Language Processing Notes-Opinion Mining I
Emotion Analysis, Taught by Dr. Paul Buitelaar & Dr. Omnia Zayed

14) Advanced Natural Language Processing Assignment 1,Taught by Dr. Paul Buitelaar & Dr. Omnia Zayed.