# Day 15 - classification

In [3]:
import pandas as pd

In [4]:
data = pd.read_csv("data/personality_datasert.csv")

In [5]:
data

Unnamed: 0,Time_spent_Alone,Stage_fear,Social_event_attendance,Going_outside,Drained_after_socializing,Friends_circle_size,Post_frequency,Personality
0,4.0,No,4.0,6.0,No,13.0,5.0,Extrovert
1,9.0,Yes,0.0,0.0,Yes,0.0,3.0,Introvert
2,9.0,Yes,1.0,2.0,Yes,5.0,2.0,Introvert
3,0.0,No,6.0,7.0,No,14.0,8.0,Extrovert
4,3.0,No,9.0,4.0,No,8.0,5.0,Extrovert
...,...,...,...,...,...,...,...,...
2895,3.0,No,7.0,6.0,No,6.0,6.0,Extrovert
2896,3.0,No,8.0,3.0,No,14.0,9.0,Extrovert
2897,4.0,Yes,1.0,1.0,Yes,4.0,0.0,Introvert
2898,11.0,Yes,1.0,3.0,Yes,2.0,0.0,Introvert


## Explanation of features:
    - Time_spent_Alone: Hours spent alone daily (0–11).
    - Stage_fear: Presence of stage fright (Yes/No).
    - Social_event_attendance: Frequency of social events (0–10).
    - Going_outside: Frequency of going outside (0–7).
    - Drained_after_socializing: Feeling drained after socializing (Yes/No).
    - Friends_circle_size: Number of close friends (0–15).
    - Post_frequency: Social media post frequency (0–10).
    - Personality: Target variable (Extrovert/Introvert).*

In [6]:
data.info()

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2900 entries, 0 to 2899
Data columns (total 8 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Time_spent_Alone           2900 non-null   float64
 1   Stage_fear                 2900 non-null   object 
 2   Social_event_attendance    2900 non-null   float64
 3   Going_outside              2900 non-null   float64
 4   Drained_after_socializing  2900 non-null   object 
 5   Friends_circle_size        2900 non-null   float64
 6   Post_frequency             2900 non-null   float64
 7   Personality                2900 non-null   object 
dtypes: float64(5), object(3)
memory usage: 181.4+ KB


In [8]:
data["Personality"].value_counts()

Personality
Extrovert    1491
Introvert    1409
Name: count, dtype: int64

There are thus quite similar amounts of observations of extroverts to introverts

In [13]:
import altair as alt

alt.Chart(data).mark_point().encode(
    x="Time_spent_Alone",
    y="Friends_circle_size",
    color="Personality"
)

We can thus see a clear difference between introverts and extroverts, as they are quite bunched together, showing a clear relationship between the features and either being an introvert or an extrovert

In [14]:
from sklearn import set_config

# Output dataframes instead of arrays
set_config(transform_output="pandas")

In [15]:
from sklearn.neighbors import KNeighborsClassifier

In [16]:
data_train = data[["Personality", "Friends_circle_size", "Time_spent_Alone"]]

In [17]:
knn = KNeighborsClassifier(n_neighbors=5)
knn

0,1,2
,n_neighbors,5
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'minkowski'
,metric_params,
,n_jobs,


In [26]:
knn.fit(X=data_train[["Friends_circle_size", "Time_spent_Alone"]], y=data_train["Personality"])

0,1,2
,n_neighbors,5
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'minkowski'
,metric_params,
,n_jobs,


In [27]:
new_obs = pd.DataFrame({"Friends_circle_size": [0], "Time_spent_Alone": [3.5]})
knn.predict(new_obs)

array(['Introvert'], dtype=object)

In [28]:
new_obs

Unnamed: 0,Friends_circle_size,Time_spent_Alone
0,0,3.5


In [39]:
data_train.assign(predicted=knn.predict(data_train.loc[:, ['Friends_circle_size', 'Time_spent_Alone']]))

Unnamed: 0,Personality,Friends_circle_size,Time_spent_Alone,predicted
0,Extrovert,13.0,4.0,Extrovert
1,Introvert,0.0,9.0,Introvert
2,Introvert,5.0,9.0,Introvert
3,Extrovert,14.0,0.0,Extrovert
4,Extrovert,8.0,3.0,Extrovert
...,...,...,...,...
2895,Extrovert,6.0,3.0,Extrovert
2896,Extrovert,14.0,3.0,Extrovert
2897,Introvert,4.0,4.0,Introvert
2898,Introvert,2.0,11.0,Introvert


In [46]:
data_train = data_train.assign(correct_prediction=data_train["Personality"] == data_train["predicted"])

In [47]:
data_train["correct_prediction"].value_counts()

correct_prediction
True     2705
False     195
Name: count, dtype: int64

In [58]:
error = 195/2705

In [59]:
print(f"Error rate: {error:%}")

Error rate: 7.208872%
