|  Column name  |  Description  |
| ----- | ------- |
| Num_posts        | Number of total posts that the user has ever posted   |
| Num_following    | Number of following                                   |
| Num_followers    | Number of followers                                   |
| Biography_length | Length (number of characters) of the user's biography |
| Picture_availability | Value 0 if the user has no profile picture, or 1 if has |
| Link_availability| Value 0 if the user has no external URL, or 1 if has |
| Average_caption_length | The average number of character of captions in media |
| Caption_zero     | Percentage (0.0 to 1.0) of captions that has almost zero (<=3) length |
| Non_image_percentage | Percentage (0.0 to 1.0) of non-image media. There are three types of media on an Instagram post, i.e. image, video, carousel
| Engagement_rate_like | Engagement rate (ER) is commonly defined as (num likes) divide by (num media) divide by (num followers)
| Engagement_rate_comment | Similar to ER like, but it is for comments |
| Location_tag_percentage | Percentage (0.0 to 1.0) of posts tagged with location |
| Average_hashtag_count   | Average number of hashtags used in a post |
| Promotional_keywords | Average use of promotional keywords in hashtag, i.e. regrann, contest, repost, giveaway, mention, share, give away, quiz |
| Followers_keywords | Average use of followers hunter keywords in hashtag, i.e. follow, like, folback, follback, f4f|
| Cosine_similarity  | Average cosine similarity of between all pair of two posts a user has |
| Post_interval      | Average interval between posts (in hours) |
| real_fake          | r (real/authentic user), f (fake user/bought followers) |

# Q1: Import labraries

In [1]:
import pandas as pd
from sklearn.metrics import classification_report,confusion_matrix
import warnings
warnings.filterwarnings("ignore")

# Q2: Read instagram_users.csv file

In [2]:
df = pd.read_csv('instagram_users.csv')
df.head()

Unnamed: 0,Num_posts,Num_following,Num_followers,Biography_length,Picture_availability,Link_availability,Average_caption_length,Caption_zero,Non_image_percentage,Engagement_rate_like,Engagement_rate_comment,Location_tag_percentage,Average_hashtag_count,Promotional_keywords,Followers_keywords,Cosine_similarity,Post_interval,real_fake
0,44,48,325,33,1,0,12,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.094985,fake
1,10,66,321,150,1,0,213,0.0,1.0,14.39,1.97,0.0,1.5,0.0,0.0,0.206826,230.412857,fake
2,33,970,308,101,1,1,436,0.0,1.0,10.1,0.3,0.0,2.5,0.0,0.056,0.572174,43.569939,fake
3,70,86,360,14,1,0,0,1.0,0.0,0.78,0.06,0.0,0.0,0.0,0.0,1.0,5.859799,fake
4,3,21,285,73,1,0,93,0.0,0.0,14.29,0.0,0.667,0.0,0.0,0.0,0.300494,0.126019,fake


In [3]:
df.shape

(64244, 18)

# Q3: Split tha dataset into training and testing

In [4]:
X = df.drop('real_fake', axis=1).to_numpy()
y = df['real_fake'].to_numpy()

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.20 ,random_state=23)

In [6]:
print(X_train.shape, X_test.shape)

(51395, 17) (12849, 17)


# Q4: Build three machine models 

## Q4.1: The first machine model
- Print the model's name.
- Print the model's accuracy.
- Print the model's confusion matrix.

In [7]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)

In [8]:
y_pred = dtc.predict(X_test)

In [9]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

        fake       0.85      0.85      0.85      6414
        real       0.85      0.85      0.85      6435

    accuracy                           0.85     12849
   macro avg       0.85      0.85      0.85     12849
weighted avg       0.85      0.85      0.85     12849



In [10]:
print(confusion_matrix(y_test, y_pred))

[[5474  940]
 [ 990 5445]]


## Q4.2: The second machine model
- Print the model's name.
- Print the model's accuracy.
- Print the model's confusion matrix.

In [11]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=120)
rfc.fit(X_train, y_train)

In [12]:
y_pred = rfc.predict(X_test)

In [13]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

        fake       0.95      0.83      0.89      6414
        real       0.85      0.96      0.90      6435

    accuracy                           0.90     12849
   macro avg       0.90      0.90      0.90     12849
weighted avg       0.90      0.90      0.90     12849



In [14]:
print(confusion_matrix(y_test, y_pred))

[[5351 1063]
 [ 274 6161]]


## Q4.3: The third machine model
- Print the model's name.
- Print the model's accuracy.
- Print the model's confusion matrix.

In [15]:
dtc2 = DecisionTreeClassifier(criterion='gini', max_depth=10,
                            random_state=23, splitter='best', min_samples_split=4,
                            min_samples_leaf=2)
dtc2.fit(X_train, y_train)

In [16]:
y_pred = dtc2.predict(X_test)

In [17]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

        fake       0.94      0.82      0.87      6414
        real       0.84      0.95      0.89      6435

    accuracy                           0.88     12849
   macro avg       0.89      0.88      0.88     12849
weighted avg       0.89      0.88      0.88     12849



In [18]:
print(confusion_matrix(y_test, y_pred))

[[5246 1168]
 [ 353 6082]]
