I have created a sipler model for Instagram Fake account detection using 2 approaches for Logistic Regression, you can access the code from the below link:

https://www.kaggle.com/code/jasvindernotra/instagram-fake-account-detection

My aim in this particular notebook would be to create a slightly complex model and compare it with the above base model. Please visit the above link to understand the data in general - some of the Exploratory Data Analysis that I have done on the dataset.

In [1]:
#adding the dependencies in the start of the notebook
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV
import pickle

Going to upload the data into dataframe df, just going to check the head since the exploratory was done in the previous program that I have run.

In [2]:
df = pd.read_csv('instagram.csv')
df.head()

Unnamed: 0,profile pic,nums/length username,fullname words,nums/length fullname,name==username,description length,external URL,private,#posts,#followers,#follows,fake
0,1,0.27,0,0.0,0,53,0,0,32,1000,955,0
1,1,0.0,2,0.0,0,44,0,0,286,2740,533,0
2,1,0.1,2,0.0,0,0,0,1,13,159,98,0
3,1,0.0,1,0.0,0,82,0,0,679,414,651,0
4,1,0.0,2,0.0,0,0,0,1,6,151,126,0


Going to address the outliers using log transformation, see previous model run for more information.

In [3]:
#addressing the outliers
features_with_outliers = ['#posts', '#followers', '#posts']

for feature in features_with_outliers:
    df[feature] = np.log1p(df[feature])

In [4]:
#adding features in X and the target variable in y
X = df.drop('fake', axis=1)
y = df['fake']

In [5]:
#adding the train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [6]:
#Standardizing the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

Just to check the performance for the models, I will first do a quick run through in a for loop with all features included.

For a more complex classification model we could use Gradient Boosting Classifier, Random Forest Classifier or Support vector machine

In [7]:
#Define the models in a for loop
models = {
    "Gradient Boosting": GradientBoostingClassifier(random_state = 42),
    "Random Forest" : RandomForestClassifier(random_state = 42),
    "Support Vector Machine" : SVC(random_state = 42)
}

In [8]:
# Initiating dictionary to hold the results for each model
results = {
    "Model" : [],
    "Accuracy" : [],
    "Precision" : [],
    "Recall" : [],
    "F1 Score" : []
}

In [9]:
#For each model, running a for loop
for model_name, model in models.items():
    #train the model
    model.fit(X_train, y_train)
    
    #Make predictions using the test set
    y_pred = model.predict(X_test)
    
    #Calculating the performance metrics:
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    #results of the model
    results["Model"].append(model_name)
    results["Accuracy"].append(accuracy)
    results["Precision"].append(precision)
    results["Recall"].append(recall)
    results["F1 Score"].append(f1)
    
#converting the results into dataframe for easier viewing
results_df = pd.DataFrame(results)
print(results_df)

                    Model  Accuracy  Precision    Recall  F1 Score
0       Gradient Boosting  0.922414   0.940000  0.886792  0.912621
1           Random Forest  0.896552   0.886792  0.886792  0.886792
2  Support Vector Machine  0.922414   0.940000  0.886792  0.912621


Based on the results, the metrics of the Gradient boosting model suggests that the model has done a good job of predicting the class labels. Random Forest on the other hand has a slightly lower performance on all the parameters, while Support vector machines seem to have a performance based on all metrics which is very close to the gradient boosting model.

In order to move ahead with the model, I want to understand if either of these models are prone to overfitting or underfitting in any way.

In order to do that, I am going to add another for loop to get the accuracy of both the training and the testing data.

In [10]:
for model_name, model in models.items():
    # Calculate the accuracy on the training set
    train_accuracy = accuracy_score(y_train, model.predict(X_train))
    
    # Calculate the accuracy on the test set
    test_accuracy = accuracy_score(y_test, y_pred)
    
    print(f"{model_name}:")
    print(f"Training set accuracy: {train_accuracy}")
    print(f"Test set accuracy: {test_accuracy}\n")

Gradient Boosting:
Training set accuracy: 0.9978260869565218
Test set accuracy: 0.9224137931034483

Random Forest:
Training set accuracy: 0.9978260869565218
Test set accuracy: 0.9224137931034483

Support Vector Machine:
Training set accuracy: 0.95
Test set accuracy: 0.9224137931034483



- Gradient Boosting - huge gap between the accuracies of training and testing data indicating some degree of overfitting. The model may be fitting too closely to the training data and capturing noise in addition to the underlying patterns.
- Random Forest - Nearly perfect training dataset and lower test accuracy indicating overfitting again.
- Support vector Machines - The difference between the training and test set accuracy is smaller than for the other models, suggesting that the SVM model may be less prone to overfitting.

I am more comfortable moving ahead with the SVM model as it seems to be more balanced. Also like we saw in the EDA, the dataset had many non-linear dependencies and SVM seem to work well in such scenarios.

However, just like I did before, I really want to try another model with a few features dropped :) Just to see how it works, for my satisfaction.

In [11]:
X1 = df.drop(['#posts', '#followers', '#follows', 'name==username', 'nums/length fullname', 'fake'], axis=1)

In [12]:
# Split the data into a training set and a test set
X1_train, X1_test, y_train, y_test = train_test_split(X1, y, test_size=0.2, random_state=42)

In [13]:
# Standardize the features
scaler = StandardScaler()
X1_train = scaler.fit_transform(X1_train)
X1_test = scaler.transform(X1_test)

In [14]:
#For each model, running a for loop
for model_name, model in models.items():
    #train the model
    model.fit(X1_train, y_train)
    
    #Make predictions using the test set
    y_pred = model.predict(X1_test)
    
    #Calculating the performance metrics:
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    #results of the model
    results["Model"].append(model_name)
    results["Accuracy"].append(accuracy)
    results["Precision"].append(precision)
    results["Recall"].append(recall)
    results["F1 Score"].append(f1)
    
#converting the results into dataframe for easier viewing
results_df = pd.DataFrame(results)
print(results_df)

                    Model  Accuracy  Precision    Recall  F1 Score
0       Gradient Boosting  0.922414   0.940000  0.886792  0.912621
1           Random Forest  0.896552   0.886792  0.886792  0.886792
2  Support Vector Machine  0.922414   0.940000  0.886792  0.912621
3       Gradient Boosting  0.879310   0.953488  0.773585  0.854167
4           Random Forest  0.887931   0.976190  0.773585  0.863158
5  Support Vector Machine  0.879310   0.953488  0.773585  0.854167


Clearly the new model is less efficient compared to the first model which was running on all features.
For the hyper parameter tuning, I'll use the GridSearchCV. For SVM model, the hyper parameters to tune are C, kernel and gamma.
source: https://scikit-learn.org/stable/modules/grid_search.html

In [15]:
# Define the parameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

In [16]:
# Create a GridSearchCV object
grid_search = GridSearchCV(SVC(random_state=42), param_grid, cv=5, scoring='accuracy')

In [17]:
# Perform the grid search
grid_search.fit(X_train, y_train)

In [18]:
# Get the best parameters
best_params = grid_search.best_params_

print(f"Best parameters: {best_params}")

Best parameters: {'C': 0.1, 'gamma': 'scale', 'kernel': 'linear'}


In [19]:
# Create a new SVM model with the optimal parameters
svm_model_optimized = SVC(C=0.1, gamma='scale', kernel='linear', random_state=42)

In [20]:
# Train the model
svm_model_optimized.fit(X_train, y_train)

In [21]:
# Evaluate the model
y_pred = svm_model_optimized.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
train_accuracy = accuracy_score(y_train, svm_model_optimized.predict(X_train))
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

In [22]:
print(f"Test Accuracy: {test_accuracy}")
print(f"Train Accuracy: {train_accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

Test Accuracy: 0.9310344827586207
Train Accuracy: 0.941304347826087
Precision: 0.9411764705882353
Recall: 0.9056603773584906
F1 Score: 0.9230769230769231


Clearly the optimized support vector machine (SVM model) shows improved performance compared to the previous version.

The test accuracy and train accuracy difference is smaller now, indicating that the overfitting is mitigated to some extent.
Both test accuracy and F1 score has increased compared to the base SVM model. Finally, this concludes that the hyper parameter training has worked in our favour and the improved the model's ability to classify fake instagram profiles.

**Probably, in one of the future tasks I may try to implement neural networks and then gauge the performance.**

**BUT** before that, I want to try one more thing. The Gradient boost allows you to make certain feature selections, I want to try that before concluding this task.

In [23]:
# Create a GradientBoostingClassifier model
gb_model = GradientBoostingClassifier(random_state=42)

In [24]:
# Train the model
gb_model.fit(X_train, y_train)

In [25]:
# Get feature importance
feature_importances = gb_model.feature_importances_

In [26]:
feature_names = df.drop('fake', axis=1).columns

In [27]:
# Create a DataFrame to display the feature importances
features_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importances
})

In [28]:
# Sort the DataFrame by importance in descending order
features_df = features_df.sort_values(by='Importance', ascending=False)

print(features_df)

                 Feature  Importance
9             #followers    0.630692
1   nums/length username    0.101026
0            profile pic    0.094882
8                 #posts    0.074910
5     description length    0.046931
10              #follows    0.028620
2         fullname words    0.019803
6           external URL    0.001349
7                private    0.001311
4         name==username    0.000281
3   nums/length fullname    0.000195


We can see that the top 5 features itself account for 84% importance in the model. I am interested to see the performance of the model based on just these 5 feature selections.

In [29]:
# Select the top 5 features
top_5_features = features_df.iloc[:5, 0].values

# Get the integer indices of the top 5 features from the original DataFrame
top_5_indices = [df.columns.get_loc(feature) for feature in top_5_features]

# Select only the top 5 features from X_train and X_test
X_train_selected = X_train[:, top_5_indices]
X_test_selected = X_test[:, top_5_indices]

In [30]:
# Create a GradientBoostingClassifier model
gb_model_new = GradientBoostingClassifier(random_state=42)

# Train the model
gb_model_new.fit(X_train_selected, y_train)

# Evaluate the model
y_pred = gb_model_new.predict(X_test_selected)
test_accuracy = accuracy_score(y_test, y_pred)
train_accuracy = accuracy_score(y_train, gb_model_new.predict(X_train_selected))
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Test Accuracy: {test_accuracy}")
print(f"Train Accuracy: {train_accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

Test Accuracy: 0.896551724137931
Train Accuracy: 0.9891304347826086
Precision: 0.8867924528301887
Recall: 0.8867924528301887
F1 Score: 0.8867924528301887


The performance of the Gradient Boosting model trained on the top 5 features is slightly lower than the original model trained on all features.

However, the model has become simpler as it's now using fewer features, which can make it easier to interpret and less prone to overfitting. 

But in this case, the training accuracy is significantly higher than the test accuracy (0.989 vs 0.897), suggesting that the model might be overfitting to the training data. 
This could be due to the model complexity of Gradient Boosting and the reduced feature set not providing enough information for the model to generalize well to unseen data.

In terms of selecting a final model, it's a trade-off between model complexity and performance. The SVM model had similar performance to the original Gradient Boosting model but was less prone to overfitting, making it a good choice. The simpler Gradient Boosting model has slightly lower performance but is easier to interpret.

I am interested to use the SVM model, the one that is optimized because it generalizes well and the performance is good too.

Please share your comments on what you think of the model and if I should be optimizing this further at all.

In [31]:
pickle.dump(svm_model_optimized,open('model.pkl','wb'))