An intelligence quotient (IQ) is a total score derived from a set of standardized tests or subtests designed to assess human intelligence. The abbreviation "IQ" was coined by the psychologist William Stern for the German term Intelligenzquotient, his term for a scoring method for intelligence tests at University of Breslau he advocated in a 1912 book.

Historically, IQ was a score obtained by dividing a person's mental age score, obtained by administering an intelligence test, by the person's chronological age, both expressed in terms of years and months. The resulting fraction (quotient) is multiplied by 100 to obtain the IQ score. For modern IQ tests, the median raw score of the norming sample is defined as IQ 100 and scores each standard deviation (SD) up or down are defined as 15 IQ points greater or less. By this definition, approximately two-thirds of the population scores are between IQ 85 and IQ 115. About 2.5 percent of the population scores above 130, and 2.5 percent below 70. https://en.wikipedia.org/wiki/Intelligence_quotient

In [None]:
#codes from Rodrigo Lima  @rodrigolima82
from IPython.display import Image
Image(url = 'https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcRmr9DLU2itRpL5VaXZFXiBSVgQDuXFDRqHDsyePcJNIiBea8jR&usqp=CAU',width=400,height=400)

iq-test.net

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import feature_extraction, linear_model, model_selection, preprocessing
import plotly.graph_objs as go
import plotly.offline as py
import plotly.express as px
from plotly.offline import iplot
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('../input/alphaversion-fullscale-iq-test-responses/data.csv', encoding='ISO-8859-2')
df.head()

#Codes from Will Koehrsen https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction

In [None]:
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns

In [None]:
df_missing= missing_values_table(df)
df_missing

Let's look at the number of columns of each data type. int64 and float64 are numeric variables (which can be either discrete or continuous). object columns contain strings and are categorical features. .

In [None]:
# Number of each type of column
df.dtypes.value_counts()

Let's now look at the number of unique entries in each of the object (categorical) columns.

In [None]:
# Number of unique classes in each object column
df.select_dtypes('object').apply(pd.Series.nunique, axis = 0)

Most of the categorical variables have a relatively large number of unique entries. We will need to find a way to deal with these categorical variables!

A machine learning model unfortunately cannot deal with categorical variables (except for some models such as LightGBM). 
Label encoding: assign each unique category in a categorical variable with an integer. No new columns are created.
One-hot encoding: create a new column for each unique category in a categorical variable. Each observation recieves a 1 in the column for its corresponding category and a 0 in all other new columns.

The only downside to one-hot encoding is that the number of features (dimensions of the data) can explode with categorical variables with many categories. To deal with this, we can perform one-hot encoding followed by PCA or other dimensionality reduction methods to reduce the number of dimensions (while still trying to preserve information).

#Label Encoding and One-Hot Encoding

Let's implement the policy described above: for any categorical variable (dtype == object) with 2 unique categories, we will use label encoding, and for any categorical variable with more than 2 unique categories, we will use one-hot encoding.

For label encoding, we use the Scikit-Learn LabelEncoder and for one-hot encoding, the pandas get_dummies(df) function.

In [None]:
# Create a label encoder object
le = LabelEncoder()
le_count = 0

# Iterate through the columns
for col in df:
    if df[col].dtype == 'object':
        # If 2 or fewer unique categories
        if len(list(df[col].unique())) <= 2:
            # Train on the training data
            le.fit(df[col])
            # Transform both training and testing data
            df[col] = le.transform(df[col])
            #app_test[col] = le.transform(app_test[col])
            
            # Keep track of how many columns were label encoded
            le_count += 1
            
print('%d columns were label encoded.' % le_count)

In [None]:
# one-hot encoding of categorical variables
df = pd.get_dummies(df)
#app_test = pd.get_dummies(app_test)

print('Training Features shape: ', df.shape)
#print('Testing Features shape: ', app_test.shape)

In [None]:
ext_data = df[['VQ1s', 'testelapse', 'introelapse', 'endelapse', 'MQ6e']]
ext_data_corrs = ext_data.corr()
ext_data_corrs

In [None]:
# Copy the data for plotting
plot_data = ext_data.drop(columns = ['testelapse']).copy()

# Add in the age of the client in years
plot_data['introelapse'] = df['introelapse']

# Drop na values and limit to first 100000 rows
plot_data = plot_data.dropna().loc[:100000, :]

# Function to calculate correlation coefficient between two columns
def corr_func(x, y, **kwargs):
    r = np.corrcoef(x, y)[0][1]
    ax = plt.gca()
    ax.annotate("r = {:.2f}".format(r),
                xy=(.2, .8), xycoords=ax.transAxes,
                size = 20)

# Create the pairgrid object
grid = sns.PairGrid(data = plot_data, size = 3, diag_sharey=False,
                    hue = 'VQ1s', 
                    vars = [x for x in list(plot_data.columns) if x != 'VQ1s'])

# Upper is a scatter plot
grid.map_upper(plt.scatter, alpha = 0.2)

# Diagonal is a histogram
grid.map_diag(sns.kdeplot)

# Bottom is density plot
grid.map_lower(sns.kdeplot, cmap = plt.cm.OrRd_r);

plt.suptitle('VQ1s & introelapse Features Pairs Plot', size = 32, y = 1.05);

#Codes from Parul Pandey  https://www.kaggle.com/parulpandey/a-guide-to-handling-missing-values-in-python

In [None]:
import missingno as msno
#msno.bar(df)

In [None]:
#msno.matrix(df)

In [None]:
#msno.heatmap(df)

In [None]:
#msno.dendrogram(df)

In [None]:
df.isnull().sum()

In [None]:
#df_1 = df.copy()
#df_1['VQ2a'].mean() #pandas skips the missing values and calculates mean of the remaining values.

Basic Imputation Techniques
Imputating with a constant value
Imputation using the statistics (mean, median or most frequent) of each column in which the missing values are located
For this we shall use the The SimpleImputer class from sklearn.

In [None]:
# imputing with a constant

from sklearn.impute import SimpleImputer
df_constant = df.copy()
#setting strategy to 'constant' 
mean_imputer = SimpleImputer(strategy='constant') # imputing using constant value
df_constant.iloc[:,:] = mean_imputer.fit_transform(df_constant)
df_constant.isnull().sum()

In [None]:
from sklearn.impute import SimpleImputer
df_most_frequent = df.copy()
#setting strategy to 'mean' to impute by the mean
mean_imputer = SimpleImputer(strategy='most_frequent')# strategy can also be mean or median 
df_most_frequent.iloc[:,:] = mean_imputer.fit_transform(df_most_frequent)

In [None]:
df_most_frequent.isnull().sum()

#K-Nearest Neighbor Imputation

The KNNImputer class provides imputation for filling in missing values using the k-Nearest Neighbors approach.Each missing feature is imputed using values from n_neighbors nearest neighbors that have a value for the feature. The feature of the neighbors are averaged uniformly or weighted by distance to each neighbor.

In [None]:
df_knn = df.copy(deep=True)

In [None]:
from sklearn.impute import KNNImputer
df_knn = df.copy(deep=True)

knn_imputer = KNNImputer(n_neighbors=2, weights="uniform")
df_knn['RQ6a'] = knn_imputer.fit_transform(df_knn[['RQ6a']])

In [None]:
df_knn['RQ6a'].isnull().sum()

#Multivariate feature imputation - Multivariate imputation by chained equations (MICE)

A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion. It performns multiple regressions over random sample ofthe data, then takes the average ofthe multiple regression values and uses that value to impute the missing value. In sklearn, it is implemented as follows:

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
df_mice = df.copy(deep=True)

mice_imputer = IterativeImputer()
df_mice['RQ6a'] = mice_imputer.fit_transform(df_mice[['RQ6a']])

In [None]:
df_mice['RQ6a'].isnull().sum()

#IQ tests have wielded a great deal of power on society over the last 120 years. 

#In the 1900s, eugenicists used the test to judge people for sterilization.

#More recently, IQ has helped inmates avoid corporal punishment and kids get the right education.

#Scientists still debate the merit of IQ, however.

In [None]:
#codes from Rodrigo Lima  @rodrigolima82
from IPython.display import Image
Image(url = 'https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcSC45MgUFkcXqMBKdsOL0oSYBbbBHWfRub5K1R54SXHeOSumQ2O&usqp=CAU',width=400,height=400)

free-iqtest.net

Kaggle Notebook Runner: Marília Prata  @mpwolke