Lab 2 - Probability and Statistics


Part A  

Use the **Lab2_dataset.csv** provided.

- Load the dataset
- Use the [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) function in sklearn to transform the "text" feature to a vector representation of a predetermined size.
- Split the dataset into training and testing

In [3]:
import pandas as pd

from scipy.stats import zscore

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.svm import SVC


In [6]:
#Loading the dataset
df = pd.read_csv('practical_labs/datasets/Lab_2/Lab2_dataset.csv')

#CountVectorizer function in sklearn
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['text'])

#Splitting the dataset into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, df['label_num'], test_size=0.2, random_state=42)

Model Training and Evaluation

- Train the Sklearn SVC model on the training dataset and evaluate on the test set

- Train and evaluate also on the Gaussian and Multinomial Naiive Bayes Classifiers

- Compare between the performance of all models and comment on the reasons behind the differences seen between the three models.


In [8]:
svc_model = SVC()
svc_model.fit(X_train, y_train)

svc_predictions = svc_model.predict(X_test)
svc_accuracy = accuracy_score(y_test, svc_predictions)

print(f"SVC Accuracy: {svc_accuracy*100:.2f}%")
print(classification_report(y_test, svc_predictions))

SVC Accuracy: 96.52%
              precision    recall  f1-score   support

           0       0.98      0.97      0.98       742
           1       0.93      0.95      0.94       293

    accuracy                           0.97      1035
   macro avg       0.95      0.96      0.96      1035
weighted avg       0.97      0.97      0.97      1035



In [11]:
# Gaussian Naive Bayes
gnb_model = GaussianNB()
gnb_model.fit(X_train.toarray(), y_train)  # Convert to dense array for GaussianNB

gnb_predictions = gnb_model.predict(X_test.toarray())
gnb_accuracy = accuracy_score(y_test, gnb_predictions)

print(f"GaussianNB Accuracy: {gnb_accuracy*100:.2f}%")
print(classification_report(y_test, gnb_predictions))

# Multinomial Naive Bayes
mnb_model = MultinomialNB()
mnb_model.fit(X_train, y_train)

mnb_predictions = mnb_model.predict(X_test)
mnb_accuracy = accuracy_score(y_test, mnb_predictions)

print(f"MultinomialNB Accuracy: {mnb_accuracy*100:.2f}%")
print(classification_report(y_test, mnb_predictions))

GaussianNB Accuracy: 95.46%
              precision    recall  f1-score   support

           0       0.95      0.99      0.97       742
           1       0.96      0.87      0.92       293

    accuracy                           0.95      1035
   macro avg       0.96      0.93      0.94      1035
weighted avg       0.95      0.95      0.95      1035

MultinomialNB Accuracy: 97.87%
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       742
           1       0.96      0.96      0.96       293

    accuracy                           0.98      1035
   macro avg       0.97      0.97      0.97      1035
weighted avg       0.98      0.98      0.98      1035



 The Multinomial Naive Bayes, being tailored for text data, performed the best in this context.

Part B


Use the **AB_NYC_2019.csv** dataset for this part.

Tasks

- Remove outliers based on price per night for a given apartment/home.

- Compare the Z-score approach and the whiskers approach in terms of who is better to remove the outliers in this case.

In [8]:
df = pd.read_csv('D:\AIAlgorithm\Fall2023\practical_labs\datasets\Lab_2\AB_NYC_2019.csv')

# Calculate Z-scores for the 'price' column
z_scores = zscore(df['price'])

# Get boolean array indicating the position of outliers
outliers_zscore = (z_scores > 3) | (z_scores < -3)

# Remove outliers
df_zscore_removed = df[~outliers_zscore]

Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1

lower_whisker = Q1 - 3 * IQR
upper_whisker = Q3 + 3 * IQR

outliers_whiskers = (df['price'] < lower_whisker) | (df['price'] > upper_whisker)

# Remove outliers
df_whiskers_removed = df[~outliers_whiskers]

print(f"Original dataset shape: {df.shape}")
print(f"Dataset shape after Z-score outliers removal: {df_zscore_removed.shape}")
print(f"Dataset shape after whiskers outliers removal: {df_whiskers_removed.shape}")

# You can also visualize using boxplots or histograms to visually assess the removal.


Original dataset shape: (48895, 16)
Dataset shape after Z-score outliers removal: (48507, 16)
Dataset shape after whiskers outliers removal: (47567, 16)


The whiskers method removed more entries compared to the Z-score method. 

This suggests that there were more extreme values in the dataset that fell outside the whiskers but still within 3 standard deviations from the mean.