In [1]:
import pandas as pd

In [2]:
dataset=pd.read_csv("Preprocessed Tamil Nadu ChatGPT Data.csv")

In [3]:
dataset

Unnamed: 0,Age,Gender,District,Occupation,Usage Frequency,Main Purpose,Language Used,User Rating,Education Level,Actions Taken,Device Used,Time Spent Hours
0,25,1,0,10,3.0,3,1,5,2.0,1,3,0.5
1,28,1,8,4,3.0,4,1,5,3.0,1,1,3.5
2,48,0,0,3,3.0,1,2,5,0.0,1,3,0.5
3,54,1,0,1,3.0,0,2,5,0.0,1,3,0.5
4,27,0,12,6,3.0,4,0,5,2.0,1,3,0.5
...,...,...,...,...,...,...,...,...,...,...,...,...
105,17,0,3,10,2.0,3,1,4,0.0,0,3,0.5
106,18,1,17,0,1.0,3,1,3,1.0,1,3,0.5
107,28,0,8,2,0.0,5,1,5,2.0,0,3,0.5
108,28,0,12,3,0.0,3,1,2,2.0,0,3,0.5


In [4]:
dataset.columns

Index(['Age', 'Gender', 'District', 'Occupation', 'Usage Frequency',
       'Main Purpose', 'Language Used', 'User Rating', 'Education Level',
       'Actions Taken', 'Device Used', 'Time Spent Hours'],
      dtype='object')

In [5]:
#input & output split
#assigning dependent & independent variable
independent = dataset[['Age', 'Gender', 'District', 'Occupation', 'Usage Frequency', 'Language Used', 'User Rating', 'Education Level','Actions Taken', 'Device Used', 'Time Spent Hours']] #assigning single column to separate variable
dependent = dataset[["Main Purpose"]]

In [6]:
#training & test set split
from sklearn.model_selection import train_test_split #importing a function from library
x_train,x_test,y_train,y_test=train_test_split(independent,dependent,test_size=0.30,random_state=0) 

In [7]:
from sklearn.feature_selection import SelectKBest, chi2, RFE
# Chi-Square Feature Selection
chi_selector = SelectKBest(score_func=chi2, k='all')
chi_selector.fit(independent, dependent)

chi_scores = pd.DataFrame({
    'Feature': independent.columns,
    'Chi2 Score': chi_selector.scores_
}).sort_values(by='Chi2 Score', ascending=False)

print("🔹 Chi-Square Feature Importance:\n")
print(chi_scores)

🔹 Chi-Square Feature Importance:

             Feature  Chi2 Score
3         Occupation   83.665922
0                Age   30.285949
2           District   17.234750
1             Gender   16.949841
10  Time Spent Hours   16.508250
4    Usage Frequency   15.123232
7    Education Level    9.316434
9        Device Used    6.925534
5      Language Used    4.714534
8      Actions Taken    1.808112
6        User Rating    1.050149


Highest score — this means users’ occupations strongly influence their purpose of using ChatGPT.
Based on Chi-Square feature importance, Occupation, Age, and Usage Frequency are the top three predictors influencing why people in Tamil Nadu use ChatGPT.

In [8]:
import warnings
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings("ignore", category=DataConversionWarning)

In [9]:
from sklearn.linear_model import LogisticRegression
# Logistic Regression for RFE
log_reg = LogisticRegression(max_iter=1000)

# Perform RFE
rfe_selector = RFE(log_reg, n_features_to_select=7)
rfe_selector = rfe_selector.fit(independent, dependent)

rfe_results = pd.DataFrame({
    'Feature': independent.columns,
    'Selected': rfe_selector.support_,
    'Ranking': rfe_selector.ranking_
}).sort_values(by='Ranking')

print("\n🔹 RFE Selected Features:\n")
print(rfe_results)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



🔹 RFE Selected Features:

             Feature  Selected  Ranking
1             Gender      True        1
3         Occupation      True        1
4    Usage Frequency      True        1
5      Language Used      True        1
7    Education Level      True        1
8      Actions Taken      True        1
10  Time Spent Hours      True        1
9        Device Used     False        2
6        User Rating     False        3
0                Age     False        4
2           District     False        5


Selected (Important) Features:
'Gender', 'Occupation', 'Usage Frequency', 'Language Used', 'Education Level', 'Actions Taken', 'Time Spent Hours'

Removed (Less Important) Features:
'Device Used', 'User Rating', 'Age', 'District'

These seven selected features will give more efficient and accurate predictive model — avoiding overfitting and redundancy.

In [10]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame()
vif_data["Feature"] = independent.columns
vif_data["VIF"] = [variance_inflation_factor(independent.values, i) for i in range(independent.shape[1])]
print("\n🔹 Variance Inflation Factor (VIF):\n")
print(vif_data)


🔹 Variance Inflation Factor (VIF):

             Feature        VIF
0                Age  14.119846
1             Gender   1.832868
2           District   5.507664
3         Occupation   8.105220
4    Usage Frequency   7.450477
5      Language Used   3.699195
6        User Rating  22.686634
7    Education Level   4.087886
8      Actions Taken   5.317036
9        Device Used  13.532916
10  Time Spent Hours   3.836566


VIF < 5 → Low multicollinearity → Keep
5 ≤ VIF < 10 → Moderate multicollinearity → Watch carefully
VIF ≥ 10 → Serious multicollinearity → Consider removing or combining variables
User Rating & Age - strong multicollinearity

In [11]:
independent_reduced = dataset.drop(columns=['User Rating', 'Device Used', 'Age'])

In [12]:
vif_data = pd.DataFrame()
vif_data['Feature'] = independent_reduced.columns
vif_data['VIF'] = [variance_inflation_factor(independent_reduced.values, i) for i in range(independent_reduced.shape[1])]
print(vif_data)

            Feature       VIF
0            Gender  1.753290
1          District  4.751287
2        Occupation  6.783415
3   Usage Frequency  5.613485
4      Main Purpose  4.163443
5     Language Used  2.761497
6   Education Level  3.834680
7     Actions Taken  5.259265
8  Time Spent Hours  3.271013


all values drop below 10, data is now ready for modeling

<span style="color: red; font-size: 20px;">Model Creation</span>

In [13]:
independent = dataset[['Gender', 'District', 'Occupation', 'Usage Frequency', 'Language Used','Education Level','Actions Taken', 'Time Spent Hours']] #assigning single column to separate variable
dependent = dataset[["Main Purpose"]]

In [14]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.transform(x_test)

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

models = {
    "Logistic Regression": LogisticRegression(max_iter=500),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "KNN": KNeighborsClassifier(),
    "GaussianNB": GaussianNB(),
    "SVM": SVC(),
}

results = []

for name, model in models.items():
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)

    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, average='weighted', zero_division=0)
    rec = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')

    results.append({
        "Model": name,
        "Accuracy": acc,
        "Precision": prec,
        "Recall": rec,
        "F1 Score": f1
    })
results_df = pd.DataFrame(results).sort_values(by='Accuracy', ascending=False)

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [16]:
results_df

Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score
2,Random Forest,0.818182,0.824793,0.818182,0.816498
5,SVM,0.69697,0.685426,0.69697,0.687172
0,Logistic Regression,0.606061,0.647567,0.606061,0.618302
1,Decision Tree,0.606061,0.759019,0.606061,0.642018
3,KNN,0.575758,0.533701,0.575758,0.546305
4,GaussianNB,0.363636,0.777345,0.363636,0.344036


In [17]:

from sklearn.model_selection import GridSearchCV
models_params = {
    "Logistic Regression": (LogisticRegression(max_iter=500),{"C": [0.1, 1, 10]}),
    
    "Decision Tree": (DecisionTreeClassifier(random_state=42),{"max_depth": [3, 5, 7, None],"criterion": ["gini", "entropy", "log_loss"],"max_features": ["sqrt", "log2", None]}),
    
    "Random Forest": (RandomForestClassifier(random_state=42),{"n_estimators": [50, 100, 200],"max_depth": [None, 5, 10],"min_samples_split": [2, 5, 10], "criterion": ["gini", "entropy", "log_loss"],"max_features": ["sqrt", "log2", None]}),
    
    "KNN": (KNeighborsClassifier(),{"n_neighbors": [3, 5, 7, 9],"weights": ["uniform", "distance"]}),
    
    "SVM": (SVC(),{"C": [0.1, 1, 10],"kernel": ["linear", "rbf", "poly"]})
}

results = []

for name, (model, params) in models_params.items():
    grid = GridSearchCV(model, params, cv=5, scoring='f1', n_jobs=-1)
    grid.fit(x_train, y_train)

    best_model = grid.best_estimator_
    y_pred = best_model.predict(x_test)

    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted', zero_division=0)
    recall = recall_score(y_test, y_pred, average='weighted', zero_division=0)
    f1 = f1_score(y_test, y_pred, average='weighted', zero_division=0)

    results.append({
        "Model": name,
        "Best Params": grid.best_params_,
        "Accuracy": round(accuracy, 4),
        "Precision": round(precision, 4),
        "Recall": round(recall, 4),
        "F1 Score": round(f1, 4)
    })

results_df = pd.DataFrame(results)


 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan]


In [18]:
results_df

Unnamed: 0,Model,Best Params,Accuracy,Precision,Recall,F1 Score
0,Logistic Regression,{'C': 0.1},0.6667,0.5738,0.6667,0.6034
1,Decision Tree,"{'criterion': 'gini', 'max_depth': 3, 'max_fea...",0.4848,0.5816,0.4848,0.5019
2,Random Forest,"{'criterion': 'gini', 'max_depth': None, 'max_...",0.7576,0.8283,0.7576,0.785
3,KNN,"{'n_neighbors': 3, 'weights': 'uniform'}",0.3939,0.501,0.3939,0.4191
4,SVM,"{'C': 0.1, 'kernel': 'linear'}",0.5758,0.5131,0.5758,0.5392


Random Forest: Achieved the highest scores across the board with an accuracy of 0.7576, precision of 0.8283, recall of 0.7576, and F1 score of 0.7850.

Logistic Regression: Performed reasonably well, but was surpassed by Random Forest.

SVM: Showed moderate performance, but was not as effective as Logistic Regression or Random Forest.

Decision Tree: Was one of the lower-performing models.

KNN: Was the lowest-performing model with the lowest scores for all metrics.
    