Name: Deep Shah

ID: 8836846

Subject Name: CSCN8000 Artificial Intelligence Algorithms and Mathematics 

# Part A

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.metrics import accuracy_score

# Loading the dataset
Lab2_dataset_data = pd.read_csv("./Lab2_dataset.csv")
Lab2_dataset_data.head()


Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\nth...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\n( see a...",0
2,3624,ham,"Subject: neon retreat\nho ho ho , we ' re arou...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\nthis deal is to ...,0


In [4]:
# Using CountVectorizer to transform the "text" feature
vectorizer = CountVectorizer(max_features=1000)  # We can adjust max_features as we need
X = vectorizer.fit_transform(Lab2_dataset_data['text'])

# Defining the target variable and split the dataset
y = Lab2_dataset_data['label_num']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training the SVC model
svc_model = SVC()
svc_model.fit(X_train, y_train)

# Evaluating the SVC model
svc_predictions = svc_model.predict(X_test)
svc_accuracy = accuracy_score(y_test, svc_predictions)
print(f"SVC Model Accuracy: {svc_accuracy}")

# Training and evaluating the Gaussian Naive Bayes Classifier
gnb_model = GaussianNB()
gnb_model.fit(X_train.toarray(), y_train)  # Naive Bayes requires dense array
gnb_predictions = gnb_model.predict(X_test.toarray())
gnb_accuracy = accuracy_score(y_test, gnb_predictions)
print(f"Gaussian Naive Bayes Model Accuracy: {gnb_accuracy}")

# Training and evaluating the Multinomial Naive Bayes Classifier
mnb_model = MultinomialNB()
mnb_model.fit(X_train, y_train)
mnb_predictions = mnb_model.predict(X_test)
mnb_accuracy = accuracy_score(y_test, mnb_predictions)
print(f"Multinomial Naive Bayes Model Accuracy: {mnb_accuracy}")


SVC Model Accuracy: 0.9497584541062802
Gaussian Naive Bayes Model Accuracy: 0.9420289855072463
Multinomial Naive Bayes Model Accuracy: 0.9342995169082126


By comparing the accuracy of three models, it can be said that SVC model performs best.

The reason behind SVC model outperformed Gaussian Naive Bayes and Multinomial Naive Bayes model could be because of the nature of the dataset and the presence of complex relationships that the SVC was able to capture effectively. SVC works well when there are clear boundaries between classes.

Gaussian Naive Bayes models makes strong assumptions about the independence of features, which may not always be the case in real-world scenarios. This model is computationally efficient and can work well with limited amount of training data.

Multinomial Naive Bayes is specifically designed for text classification tasks where the features are discrete and may not be well suited for this dataset if compared with Gaussian Naive Bayes, which can handle continuous data.


# Part B

In [10]:
import numpy as np

AB_NYC_2019_data = pd.read_csv("./AB_NYC_2019.csv")
AB_NYC_2019_data.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [11]:
# Checking the statistics of the 'price' column to understand the distribution
print(AB_NYC_2019_data['price'].describe())


count    48895.000000
mean       152.720687
std        240.154170
min          0.000000
25%         69.000000
50%        106.000000
75%        175.000000
max      10000.000000
Name: price, dtype: float64


Based on this information, it's clear that the 'price' variable has a wide range of values with a significant spread. The presence of a minimum value of $0 and a maximum value of $10,000 suggests the need for further investigation, as these might be outliers or indicative of specific cases that require attention.

Additionally, given the high standard deviation, it's likely that there are some extreme values or outliers in the dataset. 

In [12]:
# Defining a function to remove outliers using Z-score approach
def remove_outliers_zscore(data, threshold=3):
    z_scores = np.abs((data - data.mean()) / data.std())
    return data[z_scores < threshold]

# Defining a function to remove outliers using whiskers approach
def remove_outliers_whiskers(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    lower_whisker = Q1 - 1.5 * IQR
    upper_whisker = Q3 + 1.5 * IQR
    return data[(data >= lower_whisker) & (data <= upper_whisker)]


In [13]:
# Applying the Z-score approach
clean_data_zscore = AB_NYC_2019_data.copy()
clean_data_zscore['price'] = remove_outliers_zscore(clean_data_zscore['price'])

# Applying the whiskers approach
clean_data_whiskers = AB_NYC_2019_data.copy()
clean_data_whiskers['price'] = remove_outliers_whiskers(clean_data_whiskers['price'])

In [14]:
# Comparing the statistics of 'price' before and after removing outliers
print("Statistics after Z-score approach:")
print(clean_data_zscore['price'].describe())

print("\nStatistics after whiskers approach:")
print(clean_data_whiskers['price'].describe())

Statistics after Z-score approach:
count    48507.000000
mean       138.746903
std        107.558233
min          0.000000
25%         69.000000
50%        105.000000
75%        175.000000
max        860.000000
Name: price, dtype: float64

Statistics after whiskers approach:
count    45923.000000
mean       119.970320
std         68.150148
min          0.000000
25%         65.000000
50%        100.000000
75%        159.000000
max        334.000000
Name: price, dtype: float64


Analysis based on above results:
    
1) The mean price decreased after outlier removal in both cases. This is expected, as outliers tend to skew the mean towards extreme values.

2) The standard deviation decreased significantly after outlier removal in both cases. This indicates that the spread of the data is now more concentrated around the mean.

3) The values of the percentiles remain relatively stable, indicating that the central tendency of the data did not change significantly.

4) The maximum value decreased after outlier removal, which indicates that extreme high values were identified as outliers and removed.