<a href="https://colab.research.google.com/github/AndreyPerunov/Personality-Prediction/blob/main/PersonalityPrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Personality

## Imports

In [104]:
import kagglehub
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.preprocessing import MinMaxScaler, StandardScaler

### Helper functions

## Portfolio item 1: Exploratory analysis and data preprocessing

Task:

> Choose a dataset for classification problem that is complex enough to illustrate exploratory analysis and data preprocessing techniques. First, perform an overview of the dataset (exploratory analysis) then do the preprocessing step to get the data ready for ML models training.

### Fetching the Dataset

In [15]:
path = kagglehub.dataset_download("rakeshkapilavai/extrovert-vs-introvert-behavior-data")

### Loading the dataset

In [16]:
df = pd.read_csv(path + '/personality_dataset.csv')
print(df.head())
print(df.shape)

   Time_spent_Alone Stage_fear  Social_event_attendance  Going_outside  \
0               4.0         No                      4.0            6.0   
1               9.0        Yes                      0.0            0.0   
2               9.0        Yes                      1.0            2.0   
3               0.0         No                      6.0            7.0   
4               3.0         No                      9.0            4.0   

  Drained_after_socializing  Friends_circle_size  Post_frequency Personality  
0                        No                 13.0             5.0   Extrovert  
1                       Yes                  0.0             3.0   Introvert  
2                       Yes                  5.0             2.0   Introvert  
3                        No                 14.0             8.0   Extrovert  
4                        No                  8.0             5.0   Extrovert  
(2900, 8)


### Checking for Missing Values

In [17]:
print(df.isnull().sum())

Time_spent_Alone             63
Stage_fear                   73
Social_event_attendance      62
Going_outside                66
Drained_after_socializing    52
Friends_circle_size          77
Post_frequency               65
Personality                   0
dtype: int64


### Analysis

About the dataset (quote from kaggle)

> This dataset contains 2,900 entries with 8 features related to social behavior and personality traits, designed to explore and classify individuals as Extroverts or Introverts.

 Dataset has **missing values** - *filling in missing values* is required

 Dataset has this types of features:

  - `Time_spent_Alone` - Numeric feature
  - `Stage_fear` - Binary Feature
  - `Social_event_attendance` - Ordinal feature (scale 0-10)
  - `Going_outside` - Ordinal feature (scale 0-10)
  - `Drained_after_socializing` - Binary Feature
  - `Friends_circle_size` - Numeric feature
  - `Post_frequency` - Numeric feature
  - `Personality` - Binary Feature (Target Feature)

### Filling in Missing Values

In [18]:
median_time_spent_alone = df['Time_spent_Alone'].median()
df['Time_spent_Alone'] = df['Time_spent_Alone'].fillna(median_time_spent_alone)

most_frequent_stage_fear = df['Stage_fear'].mode()[0]
df['Stage_fear'] = df['Stage_fear'].fillna(most_frequent_stage_fear)

median_social_event_attendance = df['Social_event_attendance'].median()
df['Social_event_attendance'] = df['Social_event_attendance'].fillna(median_social_event_attendance)

median_going_outside = df['Going_outside'].median()
df['Going_outside'] = df['Going_outside'].fillna(median_going_outside)

most_frequent_drained_after_socializing = df['Drained_after_socializing'].mode()[0]
df['Drained_after_socializing'] = df['Drained_after_socializing'].fillna(most_frequent_drained_after_socializing)

median_friends_circle_size = df['Friends_circle_size'].median()
df['Friends_circle_size'] = df['Friends_circle_size'].fillna(median_friends_circle_size)

median_post_frequency = df['Post_frequency'].median()
df['Post_frequency'] = df['Post_frequency'].fillna(median_post_frequency)

print(df.isnull().sum())

Time_spent_Alone             0
Stage_fear                   0
Social_event_attendance      0
Going_outside                0
Drained_after_socializing    0
Friends_circle_size          0
Post_frequency               0
Personality                  0
dtype: int64


### Converting Feature to Numeric Values

Value of `1` will represent `'Introvert'` category, while `0` will represent `'Extrovert'`.

In [19]:
df['Stage_fear'] = df['Stage_fear'].map({'Yes': 1, 'No': 0})
df['Drained_after_socializing'] = df['Drained_after_socializing'].map({'Yes': 1, 'No': 0})
df['Personality'] = df['Personality'].map({'Introvert': 1, 'Extrovert': 0})
print(df.head())

   Time_spent_Alone  Stage_fear  Social_event_attendance  Going_outside  \
0               4.0           0                      4.0            6.0   
1               9.0           1                      0.0            0.0   
2               9.0           1                      1.0            2.0   
3               0.0           0                      6.0            7.0   
4               3.0           0                      9.0            4.0   

   Drained_after_socializing  Friends_circle_size  Post_frequency  Personality  
0                          0                 13.0             5.0            0  
1                          1                  0.0             3.0            1  
2                          1                  5.0             2.0            1  
3                          0                 14.0             8.0            0  
4                          0                  8.0             5.0            0  


### Scaling The Data

In [20]:
# Performing Normalization, because we know the range
df[["Social_event_attendance", "Going_outside"]] = MinMaxScaler().fit_transform(df[["Social_event_attendance", "Going_outside"]])

# Performing Standardization
df[["Time_spent_Alone", "Friends_circle_size", "Post_frequency"]] = StandardScaler().fit_transform(df[["Time_spent_Alone", "Friends_circle_size", "Post_frequency"]])

print(df.head())

   Time_spent_Alone  Stage_fear  Social_event_attendance  Going_outside  \
0         -0.143788           0                      0.4       0.857143   
1          1.309119           1                      0.0       0.000000   
2          1.309119           1                      0.1       0.285714   
3         -1.306113           0                      0.6       1.000000   
4         -0.434369           0                      0.9       0.571429   

   Drained_after_socializing  Friends_circle_size  Post_frequency  Personality  
0                          0             1.596787        0.500271            0  
1                          1            -1.471766       -0.190744            1  
2                          1            -0.291553       -0.536251            1  
3                          0             1.832829        1.536793            0  
4                          0             0.416574        0.500271            0  


## Portfolio item 2: Logistic regression

Task:

> Train Logistic Regression and estimate its performance. Choose hyperparameters.

### Defining Logistic Regression Class

$$
z = b + \sum_{j=1}^m w_j x_j \tag{Net IПрст nput}
$$

$$
\sigma(z)=\frac{1}{1+e^{-z}} \tag{Sigmoid Function}
$$

$$
f(x) = \begin{cases}
1, \quad x \geq 0 \\
0, \quad x < 0
\end{cases}
\tag{Unit Step Function}
$$

$$
J(w)=
-\sum_{i}{
  y^{(i)}\ln{\sigma^{(i)}} + (1-y^{(i)}) \ln{\big(1 - \sigma(z^{(i)})\big)}
} \tag{Cost Function}
$$

In [100]:
class LogisticRegressionGD():
  """Logistic Regression Classifier using gradient descent"""
  def __init__(self, learning_rate=0.1, max_epochs=1000, random_state=1):
    self.weights = None
    self.bias = None

    self.learning_rate = learning_rate
    self.max_epochs = max_epochs

    self.random_state = random_state

  def __net_input(self, x: np.ndarray):
    """Calculate net input"""
    return x.dot(self.weights) + self.bias

  def __activation_function(self, z: np.ndarray):
    """Compute logistic sigmoid activation"""
    return 1. / (1. + np.exp(-np.clip(z, -250, 250)))

  def predict(self, X):
    """Return class label after unit step"""

    if self.weights is None:
      raise Exception("Model has not been trained yet")

    z = self.__net_input(X)
    return np.where(z >= 0.0, 1, 0)

  def fit(self, X: np.ndarray, y: np.ndarray):
    """Fit training data"""
    rgen = np.random.RandomState(self.random_state)

    self.weights = rgen.normal(loc=0.0, scale=0.01, size=X.shape[1])
    self.bias = rgen.normal(loc=0.0, scale=0.01)
    self.cost = []

    for epoch in range(1, self.max_epochs + 1):
      net_input = self.__net_input(X)
      output = np.clip(self.__activation_function(net_input), 1e-15, 1 - 1e-15)
      errors = (y - output)

      self.weights += self.learning_rate * X.T.dot(errors)
      self.bias += self.learning_rate * errors.sum()

      cost = (-y.dot(np.log(output)) - ((1 - y).dot(np.log(1 - output))))
      self.cost.append(cost)

    return self

### Executing with all the features

In [121]:
observations = df.iloc[:, :-1].values
true_target_values = df.iloc[:, -1].values


lrgd = LogisticRegressionGD()
results = lrgd.fit(observations, true_target_values)
accuracy = np.mean(lrgd.predict(observations) == true_target_values)

print(f"Accuracy: {accuracy}% matches")
print(f"Weights: {results.weights}")
print(f"Bias: {results.bias}")

Accuracy: 0.9244827586206896% matches
Weights: [ 53.80510663 130.80378999  25.11393121  39.81371554 124.28029353
 -35.24515634 -54.13040806]
Bias: -14.912967634522296
