<a href="https://colab.research.google.com/github/ChaturyaGajula/DATA603-Spring22-2274-Th/blob/main/MidTerm_V1_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this project, we are trying to solve the sentiment classifier problem. Unlike other problems, we need to covert data into word embedings because, computer can only understand numbers and we need to convert words to numbers and categorize the text into one of the 12 categories provided.

In [None]:
import pandas as pd
df = pd.read_csv('https://github.com/msaricaumbc/DS_data/blob/master/ds602/dataset_newsletter.csv?raw=true')

In [None]:
df.drop(columns='Unnamed: 0', inplace = True)

In [None]:
df.head()

In [None]:
df.dtypes

In [None]:
df.describe()

In [None]:
#since Category has no null values, grouping the dataframe with Category
df.groupby('category').count()

In [None]:
title_null_counts = df.groupby('category')['title'].apply(lambda x: x.isnull().sum())
print(title_null_counts)

In [None]:
body_null_counts = df.groupby('category')['body'].apply(lambda x: x.isnull().sum())
print(body_null_counts)

In [None]:
signature_null_counts = df.groupby('category')['signature'].apply(lambda x: x.isnull().sum())
print(signature_null_counts)

In [None]:
# since the amount of null values in the data is negligible, we are dropping the rows with atleast one null values. 
#To cite another reason, based on the domain, 'body' can have a more impact compared to signature or title and the null contents in the body is less 
temp_df = df.dropna()

In [None]:
temp_df.isna().sum()

In [None]:
temp_df.describe()

In [None]:
# single dot appears for 129 times in 'body' column and does not contibute much to the model. Dropping the rows which has '.'
value_to_remove = '.'
mask = temp_df['body'] != value_to_remove
filtered_df = temp_df[mask]
filtered_df

In [None]:
filtered_df.describe()

In [None]:
filtered_df.category.value_counts()

In [None]:
temp_df = pd.concat([filtered_df,week_processed], axis = 1)
temp_df.columns= [*filtered_df.columns,'isweekend']
temp_df.dropna().groupby('category')['isweekend'].value_counts()

In [None]:
#we have the filtered data. Since 3 out of 4 coumns are text data, we have to convert the text data into word embeddings.
#converting to word embeeddings
!pip install -U sentence-transformers

In [None]:
#using a model from hugging face library
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [None]:
filtered_df = filtered_df.reset_index()

In [None]:
#Since bert model does all the preprocessing in backend, we do not want to preprocess the sentence like remove stop words, converting to lowercase, removing puntuations etc.
body_embeddings = model.encode(filtered_df.body.astype(str))

In [None]:
body_embeddings.shape

In [None]:
title_embeddings = model.encode(filtered_df.title)

In [None]:
title_embeddings.shape

In [None]:
signature_embeddings = model.encode(filtered_df.signature)

In [None]:
signature_embeddings.shape

In [None]:
week_unprocessed = pd.to_datetime(filtered_df.submissiontime)

In [None]:
week_unprocessed

In [None]:
week_processed = week_unprocessed.dt.weekday.map(lambda x: 0 if x >= 5 else 1)

In [None]:
week_processed

**written analysis of cleaned data for characteristics**
- We have cleaned the data by removing null values and by removing the unnecessary values in the columns.
- We have converted the Submission time column to differentiate between weekend or weekday to reduce computational overheads.
- We have used BERT to convert words to word embeddings as it is an attention based model, and can give better results compared to TF-IDF or Count Vectorizer.

**Note**
Since BERT can handle all text related pre-processing work, we will not be performing any of the pre-processing steps explicitly

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=20)
title_transformed = pca.fit_transform(title_embeddings)
print(title_transformed.shape)
body_transformed = pca.fit_transform(body_embeddings)
print(body_transformed.shape)
signature_transformed = pca.fit_transform(signature_embeddings)
print(signature_transformed.shape)


In [None]:
title_df = pd.DataFrame(title_transformed)#, columns=['Title_Column_1', 'Title_Column_2'])

In [None]:
body_df = pd.DataFrame(body_transformed)#, columns=['Body_Column_1', 'Body_Column_2'])

In [None]:
signature_df = pd.DataFrame(signature_transformed)#, columns=['Sign_Column_1', 'Sign_Column_2'])

In [None]:
X = pd.concat([title_df, body_df,signature_df, week_processed], axis=1)#, keys=[title_df, body_df,signature_df])
X.columns=X.columns.astype(str) 

In [None]:
y = filtered_df.category

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(y)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 101)

In [None]:
#X.columns = [x for x in range(0,16)]
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('pcs', PCA()),
    ('clf', LogisticRegression())
])

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'pcs__n_components' :[ 5,15,None],
    'clf__penalty' : [None,'elasticnet']

}

In [None]:
grid_search = GridSearchCV(pipeline, param_grid = param_grid, cv=5)
grid_search.fit(X_train,y_train)

In [None]:
grid_search.best_score_

In [None]:
#X.columns = [x for x in range(0,16)]
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ('pcs', PCA()),
    ('clf', RandomForestClassifier())
])

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'pcs__n_components' :[5,15,None],
    'clf__n_estimators' : [25,100,None],
    'clf__max_depth' : [5,10,None],
}

In [None]:
grid_search = GridSearchCV(pipeline, param_grid = param_grid, cv=5)
grid_search.fit(X_train,y_train)

In [None]:
grid_search.best_score_

In [None]:
grid_search.best_params_

In [None]:
#X.columns = [x for x in range(0,16)]
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

pipeline = Pipeline([
    ('pcs', PCA()),
    ('clf', SVC())
])

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'pcs__n_components' :[10, 20,None],
    'clf__gamma' : [5,10,'scale'],
}

In [None]:
grid_search = GridSearchCV(pipeline, param_grid = param_grid, cv=3)
grid_search.fit(X_train,y_train)

In [None]:
grid_search.best_score_

In [None]:
grid_search.best_params_

so far Support Vector Classifier is the good model according to the score. But also considering Random FOrest for F1 score as the accuracy is almost same

In [None]:
forest = RandomForestClassifier(n_estimators=100)
forest.fit(X_train, y_train)

In [None]:
clf = SVC(gamma='scale') 
clf.fit(X_train, y_train)

In [None]:
y_pred_clf = clf.predict(X_test)
y_pred_forest = forest.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix
cm_clf = confusion_matrix(y_test,y_pred_clf)
cm_forest = confusion_matrix(y_test,y_pred_forest)

In [None]:
import seaborn as sns 
import matplotlib.pyplot as plt
sns.heatmap(cm_clf,annot=True,cmap = 'Blues')
plt.xlabel('Predicted labels')
plt.ylabel('True Labels')
plt.show()

In [None]:
import seaborn as sns 
import matplotlib.pyplot as plt
sns.heatmap(cm_forest,annot=True,cmap = 'Blues')
plt.xlabel('Predicted labels')
plt.ylabel('True Labels')
plt.show()

It is evident that there are good amount of misclassifications between labels 1,10 and 3. Model is not able to better distinguish between 

In [None]:
from sklearn.metrics import f1_score
f1 = f1_score(y_test,y_pred_clf,average='weighted')
print('F1 Score : ', f1*100)

In [None]:
from sklearn.metrics import f1_score
f1 = f1_score(y_test,y_pred_forest,average='weighted')
print('F1 Score : ', f1*100)

We are choosing F1 Score here as our metrics because, we cannot take accuracy into account as the classes are highly imbalabced and it is not clear that we need to reduce true positives or flase potsitives. So we are choosing the harmonic mean of precision and recall i.e, F1 Score. So far with 77.8 F1 score, random forest is the best model