Zadanie:

 Przeprowadziliśmy ankietę mającą zbadać cechy psychologiczne polskich internautów. Na wynik badania składają się dwa pliki:

1. users.csv - dane demograficzne ankietowanych oraz przeglądarka z jakiej korzystają.

2. personality.csv - profil psychologiczny ankietowanych, opisany przez 5 cech: A-E.

 
Opis cech demograficznych:

·       D01 Płeć

·       D02 Rok urodzenia

·       D03 Wykształcenie - podstawowe, zawodowe, średnie, wyższe

·       D04 Status zawodowy

·       D05 Wielkość miejscowości - wieś, do 20k, do 100k, do 500k, powyżej

·       D06 Sytuacja finansowa

·       D07 Rozmiar gospodarstwa domowego

 
Szukamy odpowiedzi na następujące pytania:

1. Czy istnieje związek pomiędzy posiadanymi przez nas informacjami o ankietowanych, a ich profilem psychologicznym?

2. Czy możemy podzielić ankietowanych na grupy osób o podobnym profilu psychologicznym? Jakie to grupy, co wyróżnia każdą z nich, jaka jest ich charakterystyka demograficzna?

 
Przeprowadź odpowiednią analizę danych. Przygotuj krótkie, wysokopoziomowe podsumowanie managementu oraz paczkę z kodem pozwalającym na odtworzenie najważniejszych wyników oraz dalszy rozwój rozwiązań 

## Exploratory Data Analysis

In [1]:
%reload_ext autoreload
%autoreload 2

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import eda

KeyboardInterrupt: 

In [None]:
users_df = pd.read_csv('data/users.csv')
personality_df = pd.read_csv('data/personality.csv')

In [None]:
users_df.head()

In [None]:
personality_df.head()

### first let's get rid of duplicated entries

In [None]:
print(f"there were {users_df.shape[0] - users_df.drop_duplicates().shape[0]} duplicated user entries" )
print(f"there were {personality_df.shape[0] - personality_df.drop_duplicates().shape[0]} duplicated personality entries" )

users_df = users_df.drop_duplicates()
personality_df = personality_df.drop_duplicates()

# Data cleaning

### 1. Split 'UserBrowser' into many columns: 'Browser', 'Version', 'Device'

In [None]:
users_df[['Browser', 'Version', 'Device']] = users_df.UserBrowser.apply(lambda row: pd.Series(row.split(" ")))
users_df.Device = users_df.Device.apply(lambda row: row.strip('()'))
users_df.drop('UserBrowser', axis=1, inplace=True)
users_df.head()

now, we have to convert the new columns into numerical ones

In [None]:
eda.categ_summary(users_df['Browser'])

most use chrome

In [None]:
eda.categ_summary(users_df['Device'])

tablets are very rare

In [None]:
eda.categ_summary(users_df['Version'])

There are too many versions, and they don't really make an impact

In [None]:
users_df = users_df.drop(['Version'], axis=1)

finally we convert categorical values into numerical by using one hot encoding

In [None]:
users_df = pd.get_dummies(users_df, columns=['Browser', 'Device'])

### 2 . sex into binary

In [None]:
users_df['Sex'] = users_df['D01'].map({"M":0, "K": 1})

### 3. Change dtypes from 'float64' to 'int16'
(Columns D05-D07 have 'nan' values, hence they are left as float)

In [None]:
users_df.dtypes

In [None]:
users_df = users_df.astype(dtype={'D02':np.int16, 'D03':np.int16, 'D04':np.int16})

#### check for those nan values

In [None]:
col_nan = users_df.isna().sum()
print("Column | %")
col_nan[col_nan > 0.]

In [None]:
users_df[users_df['D05'].isna()]

only one record with unknown information, for simplicity reasons we discard it. However, if the data is scarce it should be filled with i.e. averages ofother columns if applicable

In [None]:
users_df = users_df.dropna()

### 4. Assuming that this data comes from the Polish market, we can divide the year of birth column into generations
source : https://natemat.pl/235903,do-jakiego-pokolenia-naleze-generacja-z-to-najliczniejsza-grupa-w-polsce

![docs/pokolenia.jpg](docs/pokolenia.jpg)

In [None]:
users_df['D02'].hist()

In [None]:
year_of_birth_mapper = {"pokolenie Z": range(1995, 2020),
                       "pokolenie Y": range(1980, 1995),
                       "pokolenie X": range(1964, 1980),
                       "pokolenie BB": range(1946, 1964),
                       "other": range(users_df['D02'].min(), 1946)}

In [None]:
users_df['Generation'] = users_df['D02'].apply(lambda x: next((k for k, v in year_of_birth_mapper.items() if x in v), 0))

In [None]:
users_df['Generation'].hist()

this was just for show as we need to convert these into numerical form

In [None]:
year_of_birth_mapper_to_numerical = {"pokolenie Z": 5,
                                     "pokolenie Y": 4,
                                     "pokolenie X": 3,
                                     "pokolenie BB": 2,
                                     "other": 1}

In [None]:
users_df['Generation'] =  users_df['Generation'].apply(lambda x: next((v for k, v in year_of_birth_mapper_to_numerical.items() if x in k), 0))

In [None]:
users_df = users_df.rename(columns = {"D03": "Education", 
                                      "D05": "City size", 
                                      "D04": "Professional status", 
                                     "D06": "Financial_situation",
                                     "D07": "Size of Household"})

In [None]:
users_df = users_df.drop(["D01", "D02"], axis=1)

#### We can assume that the higher the number, the better the financial situation. Also this follows a normal distribution implying the wealth distribution is fairly representative of the population.

In [None]:
users_df['Financial_situation'].hist()

In [None]:
users_df['Size of Household'].hist()

In [None]:
users_df['Professional status'].hist()

In [None]:
users_df.head()

#### end of column preprocessing

# pre-statistical analysis: 
### let's see if we have any duplicates in the form of the same user but with different variable values

In [None]:
f"there are {len(users_df['UserIdentifier'].unique())} unique identifies in the users csv"
f"there are {len(personality_df['UserIdentifier'].unique())} unique identifies in the personality csv"

user_counts = pd.DataFrame(np.unique(users_df['UserIdentifier'], return_counts=True, return_index=False, return_inverse=False)).T
user_counts = user_counts.sort_values(by=1, ascending=False)
user_counts = user_counts[user_counts[1]>1]
user_counts.columns = ['id', 'users']

personality_counts = pd.DataFrame(np.unique(personality_df['UserIdentifier'], return_counts=True, return_index=False, return_inverse=False)).T
personality_counts = personality_counts.sort_values(by=1, ascending=False)
personality_counts = personality_counts[personality_counts[1]>1]
personality_counts.columns = ['id', 'personality']

In [None]:
user_counts

In [None]:
personality_counts

#### let's conside only the users that are present in both, since we cannot evaluate anything useful in this task from only information from one table

In [None]:
user_counts.merge(personality_counts, on='id')

In [None]:
user_1 = '77f0be1043bff8c9a56eade3b14ae1d3'
user_2 = '8015c0d8fc1e5cacfc646805a107a774'

So we have two users with a unique user id who have more than one entry. Let's explore why that is the case

In [None]:
users_df[users_df['UserIdentifier']==user_1]

In [None]:
users_df[users_df['UserIdentifier']==user_2]

### hence we can see that this is because their financial situation has changed, let's see if this had an impact on their personality

In [None]:
personality_df[personality_df['UserIdentifier']==user_1]

In [None]:
personality_df[personality_df['UserIdentifier']==user_2]

## we can see that their psychological profile may differ slightly but due to the size of number of anomalies, we will proceed to drop them from further analysis

In [None]:
users_df = users_df[~users_df['UserIdentifier'].isin([user_1, user_2])]
personality_df = personality_df[~personality_df['UserIdentifier'].isin([user_1, user_2])]

#### personality nan values

In [None]:
nan_per = personality_df.iloc[pd.isnull(personality_df).any(1).nonzero()[0]]
print(nan_per.shape)
nan_per.head()

Hence we can fill these values with the column mean

In [None]:
personality_df[['A', 'B', 'C', 'D', 'E']] = personality_df[['A', 'B', 'C', 'D', 'E']].apply(lambda x: x.fillna(x.mean()),axis=0)

In [None]:
personality_df.shape, personality_df.dropna().shape

In [None]:
users_df.shape, users_df.dropna().shape

## Now we can proceed to join the two dataframes


In [None]:
df = personality_df.merge(users_df, on='UserIdentifier')

## et voila, the final dataframe

In [None]:
df.head()

---------

# Psychological data analysis

### pair plot for personnality data

### check for group clusters

#### using t-sne

In [None]:
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
import time
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import cm
import matplotlib
matplotlib.style.use('ggplot')
import scipy

In [None]:
tmp_df = personality_df.drop(['UserIdentifier'],axis=1)

In [None]:
tmp_df = tmp_df.astype(np.float32)
tmp_df = tmp_df.dropna()

In [None]:
# computing t-SNE
time_start = time.time()
tsne = TSNE(n_components=2, verbose=3, perplexity=20, n_iter=1000,learning_rate=250)
tsne_results = tsne.fit_transform(tmp_df)
print ("t-SNE done! Time elapsed: {} seconds".format(time.time()-time_start))

In [None]:
plt.scatter(tsne_results[:,0], tsne_results[:,1])

# pair plot for correlation check

In [None]:
X = df[['A', 'B', 'C', 'D', 'E']].astype(np.float32)
Y = df.drop(['UserIdentifier','A', 'B', 'C', 'D', 'E'], axis=1).astype(np.float32)

In [None]:
sns.pairplot(X)

In [None]:
def corr_heatmap(df):
    sns.set(style="white")

    # Generate a large random dataset
    # Compute the correlation matrix
    corr = df.corr()

    # Generate a mask for the upper triangle
    mask = np.zeros_like(corr, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True
    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(11, 9))

    # Generate a custom diverging colormap
    cmap = sns.diverging_palette(220, 10, as_cmap=True)

    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
                square=True, linewidths=.5, cbar_kws={"shrink": .5})

In [None]:
corr_heatmap(X)

In [None]:
corr_heatmap(Y)

In [None]:
corr_heatmap(df.drop(['UserIdentifier'], axis=1))

#### Since we need to find the correlation between two sets of variables, then we can't use something like multiple-multiple regression. 

Instead we can use CCA is a multivariate method for comparing sets of continuous or catergorical variables to each other. It can be used (instead of multiple regression for example) when you suspect that the variables in the sets you're looking at are significantly correlated. Canonical correlation accounts for the idea of multi colinearity or covariance.

#### 1. Czy istnieje związek pomiędzy posiadanymi przez nas informacjami o ankietowanych, a ich profilem psychologicznym?

# ...

Hence, we are asking the question if there is a relationship between the user information based off their connected devices and personal status, and the users psycological profiles

In [None]:
from sklearn.cross_decomposition import CCA
cca = CCA(n_components=1, scale=True, max_iter=3000)

In [None]:
cca.fit(X, Y)

In [None]:
X_c, Y_c = cca.transform(X, Y)

In [None]:
plt.scatter(X_c, Y_c)
plt.title('Comp. 1: X vs Y (test corr = %.2f)' %
          np.corrcoef(X_c[:, 0], Y_c[:, 0])[0, 1])

##### hence we can see that there is a correlation between the two multivariate datasets

To get the significance ewe can perform the hapiro-Wilk test tests

In [None]:
from scipy import stats

In [None]:
shapiro_test = stats.shapiro(X_c[:, 0])
print(f"statistic {shapiro_test[0]}\t p-value {shapiro_test[0]}")

In [None]:
cca.score(X, Y)

### redundancy analysis

when we have results, we get structural coefficients. Here we can see the influence of each of the variables on the cross-variate relationship

In [None]:
x_load = pd.DataFrame(cca.x_loadings_).T
x_load.columns = list(X.columns)
x_load.T

In [None]:
y_load = pd.DataFrame(cca.y_loadings_).T
y_load.columns = list(Y.columns)
y_load.T

2. Czy możemy podzielić ankietowanych na grupy osób o podobnym profilu psychologicznym? Jakie to grupy, co wyróżnia każdą z nich, jaka jest ich charakterystyka demograficzna?

#### Hence from this we can deduce that the generation to which a user belongs has the biggest influence over their psycological profile and certaintly NOT their Professional status.

#### These groups that differ in psycogoical status are seperated by the generation they belong to, in other words the range of years they were born in. Therefore age has significant influence over mentality.

In [None]:
extracted = df[['A', 'B', 'C', 'D', 'E']]
y = df['Generation']

In [None]:
labels = df['Generation'].apply(lambda x: next((k for k, v in year_of_birth_mapper_to_numerical.items() if x==v), 0)).values

In [None]:

from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

In [None]:
np.unique(labels)

In [None]:
num_classes

In [None]:
use_PCA=True

In [None]:
if use_PCA:
    pca_50 = PCA(n_components=4)
    extracted = pca_50.fit_transform(extracted)
    print('Cumulative explained variation for 50 principal components: {}'.format(np.sum(pca_50.explained_variance_ratio_)))

# computing t-SNE
time_start = time.time()
tsne = TSNE(n_components=2, verbose=3, perplexity=10, n_iter=500,learning_rate=200)
tsne_results = tsne.fit_transform(extracted)
print ("t-SNE done! Time elapsed: {} seconds".format(time.time()-time_start))

# plotting part
num_classes = len(np.unique(y))
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111)
colors = cm.Spectral(np.linspace(0, 1, num_classes))

xx = tsne_results[:, 0]
yy = tsne_results[:, 1]

for i in range(num_classes):
    ax.scatter(xx[y==i], yy[y==i], color=colors[i], label=labels[i], s=30)

plt.title("t-SNE dimensions colored by class")
plt.axis('tight')
plt.legend(loc='best', scatterpoints=1, fontsize=10,prop={'size': 12})
# plt.savefig("presentation_images/t-sne"+type_+".png")
plt.xlabel('$x$ t-SNE')
plt.ylabel('$y$ t-SNE')
plt.show()