### READ ME

Use the code blocks below to answer each question. Only print the output required for each question. Do not edit the comments at the top of each code cell. Otherwise, the auto-grader may misinterpret your results. See Question 0 as an an example of how to complete a task (leave it in your notebook; don't delete it):

In [2]:
# from google.colab import drive
# drive.mount('/content/drive')

In [3]:
# Question 1: Import the data here and perform any data cleaning
# steps that you feel are necessary in this code cell. Keep all 
# cleaning steps here in the same code cell. First, check for missing
# values and print out the totals. If you have missing values, then
# either replace them (.fillna()) with a theoretically meaningful 
# value (e.g. 'other' or 0) or delete the column. You may delete rows
# as long as you can maintain a final row count >= 500.
# 
# Next, check for label skewness and print out the skewness scores. 
# Make an adjustment to the label if the skewness is > 1 or < -1. If 
# you cannot get the label between -1 to 1 after making an adjustment, 
# that is okay for now. It just means that, in practice, you would switch
# to using a Decision Trees regression model rather than MLR. 

import pandas as pd
import numpy as np

#import data set
df = pd.read_csv('tw_tweets_users_media_places.csv')

#check for missing values
print(df.isnull().sum())
print('-------------------------------------------------')

# fill in missing values in the location column
df['location'].fillna('Other', inplace=True)

#check for label skewness
label = 'likes'
print('skewness: ' + str(df[label].skew()))
print('-------------------------------------------------')

# use natural log to transform the label
df[label] = np.log1p(df[label])
print('skewness after log: ' + str(df[label].skew()))
print('-------------------------------------------------')

df.head()

tweet_id                      0
text                          0
context_annotations_count     0
count_annotations             0
count_cashtags                0
count_hashtags                0
count_mentions                0
count_urls                    0
created_at_tweet              0
lang                          0
likes                         0
quotes                        0
referenced_tweet_count        0
replies                       0
reply_settings                0
retweets                      0
source                        0
terms                         0
username                      0
created_at_author             0
followers_count               0
following_count               0
tweet_count                   0
listed_count                  0
location                     42
protected                     0
verified                      0
media_type                    0
height                        0
width                         0
preview_image_url             0
country 

Unnamed: 0,tweet_id,text,context_annotations_count,count_annotations,count_cashtags,count_hashtags,count_mentions,count_urls,created_at_tweet,lang,...,location,protected,verified,media_type,height,width,preview_image_url,country,name_place,place_type
0,1440484799970304000,This was my grandson this morning (w/autism)! ...,1,0.0,0.0,0.0,0.0,1.0,2021-09-22T01:15:13.000Z,en,...,"Victoria, BC",False,False,photo,405,813,https://pbs.twimg.com/media/E_2hSs4UcAAIOK5.jpg,Canada,Langford,city
1,1439618825171963904,Wow!! Been into #York for the first time since...,2,2.0,0.0,3.0,0.0,1.0,2021-09-19T15:54:09.000Z,en,...,"Hessay, York",False,False,photo,2048,1536,https://pbs.twimg.com/media/E_qNsE1X0AQmoK_.jpg,United Kingdom,Hessay,city
2,1248872872837332992,Sad number of ppl who lost life due to covid-1...,3,0.0,0.0,0.0,0.0,1.0,2020-04-11T07:17:50.000Z,en,...,"Maidstone, South East",False,False,photo,288,278,https://pbs.twimg.com/media/EVTjQcoXsAAlrfq.jpg,United Kingdom,Maidstone,city
3,1250729294051053568,Webinar now available‘Staying healthy at home ...,1,2.0,0.0,3.0,0.0,2.0,2020-04-16T10:14:35.000Z,en,...,"Maidstone, South East",False,False,photo,2048,2048,https://pbs.twimg.com/media/EVt7pYTXkAMGzxj.jpg,United Kingdom,Maidstone,city
4,1249612131433095168,Webinar now available‘Staying healthy at home ...,1,2.0,0.0,3.0,0.0,2.0,2020-04-13T08:15:23.000Z,en,...,"Maidstone, South East",False,False,photo,2048,2048,https://pbs.twimg.com/media/EVeDlp7X0AMuN6X.jpg,United Kingdom,Maidstone,city


## **MLR Model**

In [4]:
# Question 2: Build an MLR model based on one of the labels you 
# identified and collected during the Web Scraping Project. Keep 
# all of the code contained here in this code block. You should 
# have at least one or more features that need to be dummy coded. 
# Scale the data using a MinMax normalization. Do not include any
# unstructured features such as tweet text, product description, 
# or image URLs.
# 
# After you have build the first MLR model, trim all of the 
# insignificant features from the model so that only those 
# with p-values < 0.20 are included in your model. Yes, that is 
# higher than the typical 0.05 cutoff. However, if you have only
# 500 records, it is not uncommon to accept higher p-values. You
# do not need to split the data for this model if you are using the
# statsmodels.api package as we do in the book. Although you
# normally would in practice.
import statsmodels.api as sm
from sklearn import preprocessing

df_dummies = df.copy()

#drop tweet text, image urls
df_dummies.drop(columns=['text', 'created_at_tweet', 'preview_image_url', 'username', 'created_at_author'], inplace=True)

#drop alternative labels
df_dummies.drop(columns=['retweets', 'replies'], inplace=True)

#make dummy codes
for col in df_dummies:
  if not pd.api.types.is_numeric_dtype(df_dummies[col]):
    df_dummies = pd.get_dummies(df_dummies, columns=[col], drop_first=True, prefix="", prefix_sep="")
  

#standardize the data
df_minmax = pd.DataFrame(preprocessing.MinMaxScaler().fit_transform(df_dummies), columns=df_dummies.columns)

#run the MLR
def mlr():
  y = df_dummies[label]
  X = df_dummies.drop(columns=label).assign(const=1)
  results = sm.OLS(y, X.astype(float)).fit()
  return results
results = mlr()
# print(results.summary())

print(f'starting R Squared {results.rsquared}')
while (abs(results.pvalues.sort_values(ascending=False)[0]) > 0.20):
    # get the highest p-value column
    highestCol = (results.pvalues.sort_values(ascending=False)).index[0]
    print(f'Dropping {highestCol} with a p-value of {(results.pvalues.sort_values(ascending=False))[0]}.')
    df_dummies.drop(columns=[highestCol], inplace=True)
    # re-run the model
    y = df_dummies[label]
    X = df_dummies.drop(columns=[label]).assign(const=1)
    model = sm.OLS(y, X.astype(float))
    results = model.fit()
print("----------------------------------------------")
print(f'Final R squared {results.rsquared}')
df_dummies.head()

starting R Squared 0.12407988932804592
Dropping Charleston with a p-value of 0.9903293822261563.
Dropping Charleston, SC with a p-value of 0.990329382245585.
Dropping Walsall, England with a p-value of 0.8480140284952782.
Dropping Willenhall with a p-value of 0.8480140284665199.
Dropping Little Rock with a p-value of 0.7933474343543606.
Dropping covid%20"sensory overload" with a p-value of 0.7933474343457017.
Dropping Greenock, Scotland with a p-value of 0.5433148071078586.
Dropping Greenock with a p-value of 0.5433148070993097.
Dropping corona%20autism with a p-value of 0.41458932708690177.
Dropping tweet_count with a p-value of 0.3985844318689902.
Dropping Garston with a p-value of 0.40847484599591644.
Dropping Nottingham, England with a p-value of 0.40847484687835434.
----------------------------------------------
Final R squared 0.12294934900754606


Unnamed: 0,tweet_id,context_annotations_count,count_annotations,count_cashtags,count_hashtags,count_mentions,count_urls,likes,quotes,referenced_tweet_count,...,Yarm,Yonkers,York Hospital,Zuienkerke,İstanbul,トロン温泉 稲荷湯,city,country,neighborhood,poi
0,1440484799970304000,1,0.0,0.0,0.0,0.0,1.0,2.70805,0,0,...,0,0,0,0,0,0,1,0,0,0
1,1439618825171963904,2,2.0,0.0,3.0,0.0,1.0,2.079442,0,0,...,0,0,0,0,0,0,1,0,0,0
2,1248872872837332992,3,0.0,0.0,0.0,0.0,1.0,3.912023,1,0,...,0,0,0,0,0,0,1,0,0,0
3,1250729294051053568,1,2.0,0.0,3.0,0.0,2.0,1.386294,0,0,...,0,0,0,0,0,0,1,0,0,0
4,1249612131433095168,1,2.0,0.0,3.0,0.0,2.0,2.772589,2,0,...,0,0,0,0,0,0,1,0,0,0


## **Decision Tree Model**

In [8]:
# Question 3: Build a Decision Tree model based on one of the categorical
# labels you identified and collected during the Web Scraping Project. Keep 
# all of the code contained here in this code block. 

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

df_decision_tree = df_dummies.copy()

#set y and X
y = df['protected']                       #Not sure if this is the right thing to be predicting
X = df_decision_tree.drop(columns=label)

#train the decision tree model
clf = DecisionTreeClassifier()
clf = clf.fit(X, y)


#make a prediction using the decision tree model. print the accuracy
y_pred = clf.predict(X)
print(f'Accuracy:\t{metrics.accuracy_score(y, y_pred)}')
pd.DataFrame({'Actual': y, 'Predicted': y_pred}).sort_values(by=['Actual', 'Predicted'], ascending=[False, True]).head(5)


Accuracy:	1.0


Unnamed: 0,Actual,Predicted
0,False,False
1,False,False
2,False,False
3,False,False
4,False,False


In [6]:
# Question 4: Create a visualization of the Decision Tree model so that
# you can interpret the results

#visualize the tree
from sklearn.tree import export_graphviz
from IPython.display import Image  
import pydotplus, six
from six import StringIO

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data, filled=True, rounded=True, special_characters=True, feature_names = X.columns)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('protected.png')
Image(graph.create_png())

InvocationException: GraphViz's executables not found

## **Cluster Model**

In [None]:
# Question 5: Build a cluster model using either K-means or Agglomerative
# clustering based on which you think is best for your data type. Remember,
# k-means is best when the data types and scales are uniform and
# agglomerative is best when both are mixed. If you used k-means, then
# calculate all three metrics for determining the optimal number of clusters.
# If you use agglomerative clustering, use the Gower matrix as the distance
# measure. With either clustering algorithm, print out a value_counts() of the 
# number of cases in each cluster. Keep all code in this code cell. 

!pip install gower
import gower
from sklearn.cluster import AgglomerativeClustering

distance_matrix = gower.gower_matrix(df_decision_tree)
agg = AgglomerativeClustering(affinity='precomputed', linkage='average').fit(distance_matrix)

#make a cluster column
df_wcluster = df_decision_tree.copy()
df_wcluster['cluster'] = agg.labels_
print(df_wcluster.cluster.value_counts())
df_wcluster.head()

0    533
1      1
Name: cluster, dtype: int64


Unnamed: 0,tweet_id,context_annotations_count,count_annotations,count_cashtags,count_hashtags,count_mentions,count_urls,likes,quotes,referenced_tweet_count,...,Yonkers,York Hospital,Zuienkerke,İstanbul,トロン温泉 稲荷湯,city,country,neighborhood,poi,cluster
0,1440484799970304000,1,0.0,0.0,0.0,0.0,1.0,2.70805,0,0,...,0,0,0,0,0,1,0,0,0,0
1,1439618825171963904,2,2.0,0.0,3.0,0.0,1.0,2.079442,0,0,...,0,0,0,0,0,1,0,0,0,0
2,1248872872837332992,3,0.0,0.0,0.0,0.0,1.0,3.912023,1,0,...,0,0,0,0,0,1,0,0,0,0
3,1250729294051053568,1,2.0,0.0,3.0,0.0,2.0,1.386294,0,0,...,0,0,0,0,0,1,0,0,0,0
4,1249612131433095168,1,2.0,0.0,3.0,0.0,2.0,2.772589,2,0,...,0,0,0,0,0,1,0,0,0,0
