# Lab 2 – Probability and Statistics
## Part A

#### Preprocessing

In [36]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

In [37]:
# Loading the dataset
df = pd.read_csv(r'C:\Users\dharm\OneDrive\Desktop\Fall2023\practical_labs\datasets\Lab_2\Lab2_dataset.csv', encoding='utf-8')

In [38]:
# Extracting the necessary columns
X = df['text']
y = df['label_num']

In [39]:
# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Transform the text feature to a vector representation
X_vectorized = vectorizer.fit_transform(X)

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.2, random_state=42)  # You can adjust the test_size and random_state according to your requirements

# Check the shapes of the train and test data
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (4136, 50447)
Shape of X_test: (1035, 50447)
Shape of y_train: (4136,)
Shape of y_test: (1035,)


#### Model Training and Evaluation

In [40]:
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

In [41]:
# Initializing the models
svc_model = SVC()
gnb_model = GaussianNB()
mnb_model = MultinomialNB()

In [42]:
# Training the models
svc_model.fit(X_train, y_train)
gnb_model.fit(X_train.toarray(), y_train)
mnb_model.fit(X_train, y_train)

In [43]:
# Making predictions
svc_pred = svc_model.predict(X_test)
gnb_pred = gnb_model.predict(X_test.toarray())
mnb_pred = mnb_model.predict(X_test)

In [44]:
# Evaluating the models
print("SVC Accuracy:", accuracy_score(y_test, svc_pred))
print("Gaussian Naive Bayes Accuracy:", accuracy_score(y_test, gnb_pred))
print("Multinomial Naive Bayes Accuracy:", accuracy_score(y_test, mnb_pred))

# Print classification reports for more detailed evaluation
print(f"\nSVC Classification Report:\n{classification_report(y_test, svc_pred)}")
print(f"\nGaussian Naive Bayes Classification Report:\n{classification_report(y_test, gnb_pred)}")
print(f"\nMultinomial Naive Bayes Classification Report:\nclassification_report(y_test, mnb_pred)")

SVC Accuracy: 0.9652173913043478
Gaussian Naive Bayes Accuracy: 0.9545893719806763
Multinomial Naive Bayes Accuracy: 0.978743961352657

SVC Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.97      0.98       742
           1       0.93      0.95      0.94       293

    accuracy                           0.97      1035
   macro avg       0.95      0.96      0.96      1035
weighted avg       0.97      0.97      0.97      1035


Gaussian Naive Bayes Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.99      0.97       742
           1       0.96      0.87      0.92       293

    accuracy                           0.95      1035
   macro avg       0.96      0.93      0.94      1035
weighted avg       0.95      0.95      0.95      1035


Multinomial Naive Bayes Classification Report:
classification_report(y_test, mnb_pred)


The differences in performance are mainly attributed to the underlying assumptions of each model.<br>
SVC doesn't make the same assumptions as the Naive Bayes models, and it can capture complex relationships; this makes it suitable for various types of data.<br>
There can be limited performance on more complex datasets for Naive Bayes models as it makes strong assumptions about feature independence and data distribution where these assumptions are not met.<br>
Hence, the nature of the data and the specific task at hand should be taken care of when chosing the model.

## Part B

In [45]:
import numpy as np

In [46]:
# Loading the dataset
df = pd.read_csv(r'C:\Users\dharm\OneDrive\Desktop\Fall2023\practical_labs\datasets\Lab_2\AB_NYC_2019.csv', encoding='utf-8')

In [47]:
# Calculating Z-score
df['z_score'] = (df['price'] - df['price'].mean()) / df['price'].std()

In [48]:
# Using Z-score approach
df_z_score = df[(np.abs(df['z_score']) < 3)]

In [49]:
# Using whiskers approach
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1
lower_whisker = Q1 - 1.5 * IQR
upper_whisker = Q3 + 1.5 * IQR
df_whiskers = df[(df['price'] > lower_whisker) & (df['price'] < upper_whisker)]

In [50]:
# Printing the first 5 rows of the datasets
print(df_z_score.head())
print(df_whiskers.head())

     id                                              name  host_id  \
0  2539                Clean & quiet apt home by the park     2787   
1  2595                             Skylit Midtown Castle     2845   
2  3647               THE VILLAGE OF HARLEM....NEW YORK !     4632   
3  3831                   Cozy Entire Floor of Brownstone     4869   
4  5022  Entire Apt: Spacious Studio/Loft by central park     7192   

     host_name neighbourhood_group neighbourhood  latitude  longitude  \
0         John            Brooklyn    Kensington  40.64749  -73.97237   
1     Jennifer           Manhattan       Midtown  40.75362  -73.98377   
2    Elisabeth           Manhattan        Harlem  40.80902  -73.94190   
3  LisaRoxanne            Brooklyn  Clinton Hill  40.68514  -73.95976   
4        Laura           Manhattan   East Harlem  40.79851  -73.94399   

         room_type  price  minimum_nights  number_of_reviews last_review  \
0     Private room    149               1                  9  20

In [51]:
# Printing the results
print("Shape of the original dataset:", df.shape)
print("Shape of the dataset using Z-score approach:", df_z_score.shape)
print("Shape of the dataset using whiskers approach:", df_whiskers.shape)

print("\nMean of the original dataset:", df['price'].mean())
print("Mean of the dataset using Z-score approach:", df_z_score['price'].mean())
print("Mean of the dataset using whiskers approach:", df_whiskers['price'].mean())

print("\nStandard deviation of the original dataset:", df['price'].std())
print("Standard deviation of the dataset using Z-score approach:", df_z_score['price'].std())
print("Standard deviation of the dataset using whiskers approach:", df_whiskers['price'].std())

print("\nMinimum value of the original dataset:", df['price'].min())
print("Minimum value of the dataset using Z-score approach:", df_z_score['price'].min())
print("Minimum value of the dataset using whiskers approach:", df_whiskers['price'].min())

print("\nMaximum value of the original dataset:", df['price'].max())
print("Maximum value of the dataset using Z-score approach:", df_z_score['price'].max())
print("Maximum value of the dataset using whiskers approach:", df_whiskers['price'].max())

Shape of the original dataset: (48895, 17)
Shape of the dataset using Z-score approach: (48507, 17)
Shape of the dataset using whiskers approach: (45918, 17)

Mean of the original dataset: 152.7206871868289
Mean of the dataset using Z-score approach: 138.74690250891624
Mean of the dataset using whiskers approach: 119.94701424278061

Standard deviation of the original dataset: 240.15416974718758
Standard deviation of the dataset using Z-score approach: 107.5582327130842
Standard deviation of the dataset using whiskers approach: 68.11724909788296

Minimum value of the original dataset: 0
Minimum value of the dataset using Z-score approach: 0
Minimum value of the dataset using whiskers approach: 0

Maximum value of the original dataset: 10000
Maximum value of the dataset using Z-score approach: 860
Maximum value of the dataset using whiskers approach: 333


In [52]:
# Saving the cleaned datasets
df_z_score.to_csv('cleaned_dataset_z_score.csv', index=False)
df_whiskers.to_csv('cleaned_dataset_whiskers.csv', index=False)

### For the whiskers approach, the mean and standard deviation of the dataset look more consistent; this indicates a more stable dataset compared to the Z-score approach. Moreover, there is a significant reduction in the number of rows compared to the Z-score approach. Hence, I think whiskers approach is better to remove the outliers in this case.