<a href="https://colab.research.google.com/github/MalavikaKatta/Academic_Projects/blob/main/Computational%20Methods/Text_Classification_ML_Algorithms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Question 1.

The purpose of this question is to practice different machine learning algorithms for text classification, along with performance evaluation. In addition, you are requried to conduct *10 fold cross validation (https://scikit-learn.org/stable/modules/cross_validation.html)* in the training.

The dataset (see Canvas) contains two files train data and test data for sentiment analysis in IMDB review, and it has two categories: 1 represents positive and 0 represents negative.

You need to split the training data into training and validation data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model needs to be evaluated on the test data.

Algorithms:
1. MultinominalNB (2 points)
2. SVM (2 points)
3. KNN (2 points)
4. Decision tree (2 points)
5. Random Forest (2 points)
6. XGBoost (2 points)

Evaluation measurement:
1. Accuracy (2 points)
2. Recall (2 points)
3. Precison (2 points)
4. F-1 score (2 points)

In [None]:
!pip install scikit-learn numpy xgboost



In [None]:
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Help from Chatgpt

# Load your data
with open('/content/stsa-train.txt', 'r') as file:
    train_data = file.readlines()

with open('/content/stsa-test.txt', 'r') as file:
    test_data = file.readlines()

# Extract features (X) and labels (y) from the data
X_train = [line.strip() for line in train_data]
y_train = [int(line.split()[0]) for line in train_data]

X_test = [line.strip() for line in test_data]
y_test = [int(line.split()[0]) for line in test_data]

# Split the training data into training and validation sets (80% for training, 20% for validation)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Vectorize the text data using CountVectorizer
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)
X_test_vec = vectorizer.transform(X_test)

# Define classifiers
classifiers = {
    'MultinomialNB': MultinomialNB(),
    'SVM': SVC(),
    'KNN': KNeighborsClassifier(),
    'DecisionTree': DecisionTreeClassifier(),
    'RandomForest': RandomForestClassifier(),
    'XGBoost': XGBClassifier()
}

# Perform 10-fold cross-validation and evaluate each classifier
for name, clf in classifiers.items():
    print(f"\nTraining and evaluating {name}...")

    # Cross-validation
    cv_scores = cross_val_score(clf, X_train_vec, y_train, cv=10, scoring='accuracy')
    print(f"{name} Cross-Validation Accuracy: {np.mean(cv_scores)}")

    # Train the final model on the entire training set
    clf.fit(X_train_vec, y_train)

    # Predict on the validation set
    val_preds = clf.predict(X_val_vec)

    # Evaluate on the validation set
    accuracy = accuracy_score(y_val, val_preds)
    precision = precision_score(y_val, val_preds)
    recall = recall_score(y_val, val_preds)
    f1 = f1_score(y_val, val_preds)

    print(f"{name} Validation Accuracy: {accuracy}")
    print(f"{name} Validation Precision: {precision}")
    print(f"{name} Validation Recall: {recall}")
    print(f"{name} Validation F1 Score: {f1}")

    # Evaluate on the test set
    test_preds = clf.predict(X_test_vec)

    test_accuracy = accuracy_score(y_test, test_preds)
    test_precision = precision_score(y_test, test_preds)
    test_recall = recall_score(y_test, test_preds)
    test_f1 = f1_score(y_test, test_preds)

    print(f"\n{name} Test Accuracy: {test_accuracy}")
    print(f"{name} Test Precision: {test_precision}")
    print(f"{name} Test Recall: {test_recall}")
    print(f"{name} Test F1 Score: {test_f1}\n")



Training and evaluating MultinomialNB...
MultinomialNB Cross-Validation Accuracy: 0.7805263054817504
MultinomialNB Validation Accuracy: 0.7947976878612717
MultinomialNB Validation Precision: 0.777490297542044
MultinomialNB Validation Recall: 0.8429172510518934
MultinomialNB Validation F1 Score: 0.8088829071332435

MultinomialNB Test Accuracy: 0.8072487644151565
MultinomialNB Test Precision: 0.7852760736196319
MultinomialNB Test Recall: 0.8448844884488449
MultinomialNB Test F1 Score: 0.8139904610492847


Training and evaluating SVM...
SVM Cross-Validation Accuracy: 0.7353712275021055
SVM Validation Accuracy: 0.7557803468208093
SVM Validation Precision: 0.7394636015325671
SVM Validation Recall: 0.8120617110799438
SVM Validation F1 Score: 0.7740641711229947

SVM Test Accuracy: 0.757276221856123
SVM Test Precision: 0.7346733668341708
SVM Test Recall: 0.8041804180418042
SVM Test F1 Score: 0.7678571428571429


Training and evaluating KNN...
KNN Cross-Validation Accuracy: 0.5682773973273447


### Question 2.

The purpose of the question is to practice different machine learning algorithms for text clustering.

Please downlad the dataset by using the following link: https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones

Apply the listed clustering methods to the dataset:
1. K-means (4 points)
2. DBSCAN (4 points)
3. Hierarchical clustering (4 points)
4. Word2Vec (4 points)
5. BERT (4 points)

###1. K-Means Clustering

In [None]:
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# Load data
data = pd.read_csv('/content/Amazon_Unlocked_Mobile.csv')

# Handle missing values for numerical features
numeric_data = data[['Price', 'Rating', 'Review Votes']]
numeric_data = numeric_data.dropna()

# Impute missing values for numerical features
imputer = SimpleImputer(strategy='mean')
numeric_data_imputed = pd.DataFrame(imputer.fit_transform(numeric_data), columns=numeric_data.columns)

# Standardize numerical data
scaler = StandardScaler()
numeric_data_scaled = scaler.fit_transform(numeric_data_imputed)

# Reset the index of numeric_data
numeric_data.reset_index(drop=True, inplace=True)

# Apply K-means to numerical features
kmeans = KMeans(n_clusters=3)
cluster_assignments = kmeans.fit_predict(numeric_data_scaled)

# Reindex cluster_assignments with the original index of numeric_data
data['KMeans_Cluster'] = pd.Series(cluster_assignments, index=numeric_data.index)

# Analyze Results
print(data[['Product Name', 'Brand Name', 'KMeans_Cluster']])




                                             Product Name Brand Name  \
0       "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung   
1       "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung   
2       "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung   
3       "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung   
4       "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung   
...                                                   ...        ...   
413835  Samsung Convoy U640 Phone for Verizon Wireless...    Samsung   
413836  Samsung Convoy U640 Phone for Verizon Wireless...    Samsung   
413837  Samsung Convoy U640 Phone for Verizon Wireless...    Samsung   
413838  Samsung Convoy U640 Phone for Verizon Wireless...    Samsung   
413839  Samsung Convoy U640 Phone for Verizon Wireless...    Samsung   

        KMeans_Cluster  
0                  1.0  
1                  1.0  
2                  1.0  
3                  1.0  
4         

###2. DBSCAN Clustering

In [None]:
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.utils import resample

# Load data
data = pd.read_csv('/content/Amazon_Unlocked_Mobile.csv')

# Handle missing values for numerical features
numeric_data = data[['Price', 'Rating', 'Review Votes']]
numeric_data = numeric_data.dropna()

# Optionally, downsample the data to a smaller size
numeric_data_downsampled = resample(numeric_data, n_samples=1000, random_state=42)

# Impute missing values for numerical features
imputer = SimpleImputer(strategy='mean')
numeric_data_imputed = pd.DataFrame(imputer.fit_transform(numeric_data_downsampled), columns=numeric_data_downsampled.columns)

# Standardize numerical data
scaler = StandardScaler()
numeric_data_scaled = scaler.fit_transform(numeric_data_imputed)

# Apply DBSCAN to numerical features
try:
    dbscan = DBSCAN(eps=1.0, min_samples=5, n_jobs=-1, algorithm='auto')
    data_downsampled = data.loc[numeric_data_downsampled.index].copy()  # Ensure indices match
    data_downsampled['DBSCAN_Cluster'] = dbscan.fit_predict(numeric_data_scaled)
except Exception as e:
    print(f"Error during DBSCAN: {e}")

# Analyze Results
print(data_downsampled[['Product Name', 'Brand Name', 'DBSCAN_Cluster']])


                                             Product Name  Brand Name  \
127654               BLU Life Pure Unlocked Phone (Black)         BLU   
154319  BLU Studio Selfie - Smartphone - GSM Unlocked ...         BLU   
138620      BLU Studio 5.0 C HD Unlocked Cellphone, White         BLU   
381263  Samsung Rugby 2 A847 Unlocked GSM Ruggedized P...     Samsung   
270457  Nokia Lumia 820 8GB Unlocked GSM 4G LTE Window...       Nokia   
...                                                   ...         ...   
231762  LG Nexus 5X Unlocked Smartphone - White 32GB (...          LG   
1090    5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9...         NaN   
258552  NEW GOOGLE MOTOROLA 8GB MOTO G XT1032 BLACK FA...         NaN   
53735   Apple iPhone 6 64GB Factory Unlocked GSM 4G LT...         NaN   
100555  Blackberry Torch 9800 Unlocked Phone with 5 MP...  BlackBerry   

        DBSCAN_Cluster  
127654               0  
154319               1  
138620               1  
381263               0 

###3. Hierarchical Clustering

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import linkage, fcluster

# Load data
data = pd.read_csv('/content/Amazon_Unlocked_Mobile.csv')

# Handle missing values for numerical features
numeric_data = data[['Price', 'Rating', 'Review Votes']]
numeric_data = numeric_data.dropna()

# Optionally, downsample the data to a smaller size
numeric_data_downsampled = numeric_data.sample(n=1000, random_state=42)

# Standardize numerical data
scaler = StandardScaler()
numeric_data_scaled = scaler.fit_transform(numeric_data_downsampled)

# Apply Hierarchical Clustering to numerical features
try:
    linkage_matrix = linkage(numeric_data_scaled, method='ward')
    data_downsampled = data.loc[numeric_data_downsampled.index].copy()  # Ensure indices match
    data_downsampled['Hierarchical_Cluster'] = fcluster(linkage_matrix, 3, criterion='maxclust')
except Exception as e:
    print(f"Error during Hierarchical Clustering: {e}")

# Analyze Results
print(data_downsampled[['Product Name', 'Brand Name', 'Hierarchical_Cluster']])


                                             Product Name Brand Name  \
136630  BLU Studio 5.0 C HD Smartphone - Unlocked - Ye...        BLU   
236935  LG Prime GS390 Unlocked Phone with Touchscreen...         LG   
27179            Apple iPhone 5c 32GB - Unlocked - (Blue)      Apple   
60528   Apple iPhone 6 Plus Unlocked Cellphone, 16GB, ...        NaN   
342522  Samsung Galaxy S4 I545 16GB Verizon Wireless +...    Samsung   
...                                                   ...        ...   
50016   Apple iPhone 5s T-Mobile Cellphone, 16GB, Spac...      Apple   
396444  Sony Xperia Z3+ (Z3 Plus) E6553 5.2-Inch 32GB ...       Sony   
379446  Samsung Korea SAMSUNG GALAXY S4 MINI GT-i9195 ...    Samsung   
259778  Nokia - Asha 300 Gray Touchscreen Unlocked Qua...      Nokia   
384709  Samsung SGH-T989 Galaxy S II 16GB White Androi...    Samsung   

        Hierarchical_Cluster  
136630                     2  
236935                     1  
27179                      1  
60528      

###4. Word2Vec Clustering

In [None]:
import pandas as pd
from gensim.models import Word2Vec
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Load data
data = pd.read_csv('/content/Amazon_Unlocked_Mobile.csv', nrows=1000)

# Handle missing values for text features
text_data = data['Product Name'].fillna('') + ' ' + data['Reviews'].fillna('')

# Tokenize and clean text data for Word2Vec
tokenized_data = [str(text).split() for text in text_data]
word2vec_model = Word2Vec(sentences=tokenized_data, vector_size=100, window=5, min_count=1, workers=4)
embedded_data_word2vec = [word2vec_model.wv[word] for text in tokenized_data for word in text]

# Use a subset of the data for clustering
subset_size = min(len(data), len(embedded_data_word2vec))
data_subset = data.head(subset_size)

# Apply K-means to Word2Vec embeddings
kmeans_word2vec = KMeans(n_clusters=3, random_state=42)
data_subset['Word2Vec_Cluster'] = kmeans_word2vec.fit_predict(embedded_data_word2vec[:subset_size])

# Analyze Results
print(data_subset[['Product Name', 'Brand Name', 'Word2Vec_Cluster']])



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_subset['Word2Vec_Cluster'] = kmeans_word2vec.fit_predict(embedded_data_word2vec[:subset_size])


                                          Product Name Brand Name  \
0    "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung   
1    "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung   
2    "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung   
3    "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung   
4    "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung   
..                                                 ...        ...   
995  5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9...        NaN   
996  5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9...        NaN   
997  5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9...        NaN   
998  5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9...        NaN   
999  5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9...        NaN   

     Word2Vec_Cluster  
0                   0  
1                   2  
2                   2  
3                   2  
4                   2  
..                ...  
995

###5. BERT Clustering

In [None]:
!pip install pandas sentence-transformers



In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

# Load data
data = pd.read_csv('/content/Amazon_Unlocked_Mobile.csv')

# Handle missing values for text features
text_data = data['Product Name'].fillna('') + ' ' + data['Reviews'].fillna('')

# Embed using BERT
bert_model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
embedded_data_bert = bert_model.encode(text_data, convert_to_tensor=True)

# Apply K-means to BERT embeddings
kmeans_bert = KMeans(n_clusters=3, random_state=42)
data['BERT_Cluster'] = kmeans_bert.fit_predict(embedded_data_bert.cpu().numpy())

# Analyze Results
print(data[['Product Name', 'Brand Name', 'BERT_Cluster']])

KeyboardInterrupt: ignored

In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.

In [None]:
#You can write you answer here. (No code needed)


