|  Column name  |  Description  |
| ----- | ------- |
| Num_posts        | Number of total posts that the user has ever posted   |
| Num_following    | Number of following                                   |
| Num_followers    | Number of followers                                   |
| Biography_length | Length (number of characters) of the user's biography |
| Picture_availability | Value 0 if the user has no profile picture, or 1 if has |
| Link_availability| Value 0 if the user has no external URL, or 1 if has |
| Average_caption_length | The average number of character of captions in media |
| Caption_zero     | Percentage (0.0 to 1.0) of captions that has almost zero (<=3) length |
| Non_image_percentage | Percentage (0.0 to 1.0) of non-image media. There are three types of media on an Instagram post, i.e. image, video, carousel
| Engagement_rate_like | Engagement rate (ER) is commonly defined as (num likes) divide by (num media) divide by (num followers)
| Engagement_rate_comment | Similar to ER like, but it is for comments |
| Location_tag_percentage | Percentage (0.0 to 1.0) of posts tagged with location |
| Average_hashtag_count   | Average number of hashtags used in a post |
| Promotional_keywords | Average use of promotional keywords in hashtag, i.e. regrann, contest, repost, giveaway, mention, share, give away, quiz |
| Followers_keywords | Average use of followers hunter keywords in hashtag, i.e. follow, like, folback, follback, f4f|
| Cosine_similarity  | Average cosine similarity of between all pair of two posts a user has |
| Post_interval      | Average interval between posts (in hours) |
| real_fake          | r (real/authentic user), f (fake user/bought followers) |

# Q1: Import labraries

In [1]:
# write your code here ^_^
# To ignore warnings
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Q2: Read instagram_users.csv file

In [2]:
# write your code here ^_^
df = pd.read_csv('instagram_users.csv')
df.head()

Unnamed: 0,Num_posts,Num_following,Num_followers,Biography_length,Picture_availability,Link_availability,Average_caption_length,Caption_zero,Non_image_percentage,Engagement_rate_like,Engagement_rate_comment,Location_tag_percentage,Average_hashtag_count,Promotional_keywords,Followers_keywords,Cosine_similarity,Post_interval,real_fake
0,44,48,325,33,1,0,12,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.094985,fake
1,10,66,321,150,1,0,213,0.0,1.0,14.39,1.97,0.0,1.5,0.0,0.0,0.206826,230.412857,fake
2,33,970,308,101,1,1,436,0.0,1.0,10.1,0.3,0.0,2.5,0.0,0.056,0.572174,43.569939,fake
3,70,86,360,14,1,0,0,1.0,0.0,0.78,0.06,0.0,0.0,0.0,0.0,1.0,5.859799,fake
4,3,21,285,73,1,0,93,0.0,0.0,14.29,0.0,0.667,0.0,0.0,0.0,0.300494,0.126019,fake


# Q3: Split tha dataset into training and testing

In [3]:
# write your code here ^_^
from sklearn.model_selection import train_test_split

x = df.drop('real_fake',axis=1)
y = df['real_fake']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30)

# Q4: Build three machine models 

## Q4.1: The first machine model
- Print the model's name.
- Print the model's accuracy.
- Print the model's confusion matrix.

In [4]:
# write your code here ^_^
from sklearn.tree import DecisionTreeClassifier
dftree = DecisionTreeClassifier()
print("Model name is:",dftree.fit(x_train,y_train))


from sklearn.metrics import confusion_matrix, accuracy_score

pred = dftree.predict(x_test)
print("Accuracy Score:",accuracy_score(y_test,pred))

print("Confusion Matrix:\n",confusion_matrix(y_test,pred))


Model name is: DecisionTreeClassifier()
Accuracy Score: 0.8566462592092975
Confusion Matrix:
 [[8241 1341]
 [1422 8270]]


## Q4.2: The second machine model
- Print the model's name.
- Print the model's accuracy.
- Print the model's confusion matrix.

In [5]:
# write your code here ^_^
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

rfc = RandomForestClassifier(n_estimators = 18, criterion = 'gini', max_depth = 5) 
rfc.fit(x_train, y_train)
rfc_pred = rfc.predict(x_test)
print("Model name is:",rfc.fit(x_train, y_train))
print("Accuracy score:",accuracy_score(y_test,rfc_pred))
print("Confusion matrix:\n",confusion_matrix(y_test,rfc_pred))

param_grid = {
    "n_estimators": [10,20,30], 
    "criterion": ["gini", "entropy"],
    "max_depth": [2,4,6]   
}
grid = GridSearchCV(rfc, param_grid)
grid.fit(x_train, y_train)

print('\t####\nThe best parameters are:',grid.best_params_, 'and best accuracy score is:', grid.best_score_)

Model name is: RandomForestClassifier(max_depth=5, n_estimators=18)
Accuracy score: 0.8800975407284425
Confusion matrix:
 [[7473 2109]
 [ 202 9490]]
	####
The best parameters are: {'criterion': 'gini', 'max_depth': 6, 'n_estimators': 30} and best accuracy score is: 0.883544585279075


In [6]:
# x2 = df.drop('real_fake',axis=1)
# y2 = df['real_fake']
# x2_train, x2_test, y2_train, y2_test = train_test_split(x2, y2, test_size=0.30)

rfc = RandomForestClassifier(n_estimators = 30, criterion = 'entropy', max_depth = 6) 
rfc.fit(x_train, y_train)
rfc_pred = rfc.predict(x_test)
print("Accuracy score:",accuracy_score(y_test,rfc_pred))
print("Confusion matrix:\n",confusion_matrix(y_test,rfc_pred))

Accuracy score: 0.8845595102210232
Confusion matrix:
 [[7509 2073]
 [ 152 9540]]


## Q4.3: The third machine model
- Print the model's name.
- Print the model's accuracy.
- Print the model's confusion matrix.

In [7]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression() 
lr.fit(x_train,y_train)
lr_pred = lr.predict(x_test)
print('Modle Name:', lr)
print('Accuracy Score:',accuracy_score(y_test, lr_pred))
print('Confusion Matrix:\n',confusion_matrix(y_test, lr_pred))

Modle Name: LogisticRegression()
Accuracy Score: 0.7618553491750545
Confusion Matrix:
 [[6653 2929]
 [1661 8031]]
