<h1>Data Wrangling:</h1>
<h5>Author: Kevin Mntambo</h5>
<p><b>Description:</b>This project looks at data from applicants who took a personality test. Through the data we determine the applicants’ different personalities and identify possible high-risk individuals. </p>

In [1]:
import pandas as pd
import numpy as np
personality_scores_df = pd.read_csv("../data/personality_scores.csv",sep=";")
personality_scores_df.head()

Unnamed: 0,ID,Section 5 of 6 [I am always prepared.],Section 5 of 6 [I am easily disturbed.],Section 5 of 6 [I am exacting (demanding) in my work.],Section 5 of 6 [I am full of ideas.],Section 5 of 6 [I am interested in people.],Section 5 of 6 [I am not interested in abstract ideas.],Section 5 of 6 [I am not interested in other people's problems.],Section 5 of 6 [I am not really interested in others.],Section 5 of 6 [I am quick to understand things.],...,Unnamed: 60,Unnamed: 61,Unnamed: 62,Unnamed: 63,Unnamed: 64,Unnamed: 65,Unnamed: 66,Unnamed: 67,Unnamed: 68,IPIP_HIGH_RISK
0,0,"(3, 5)","(4, 5)","(3, 5)","(5, 5)","(2, 3)","(5, 3)","(2, 3)","(2, 5)","(5, 5)",...,,,,,,,,,,
1,1,"(3, 5)","(4, 5)","(3, 5)","(5, 5)","(2, 5)","(5, 3)","(2, 5)","(2, 5)","(5, 5)",...,,,,,,,,,,
2,2,"(3, 5)","(4, 3)","(3, 3)","(5, 5)","(2, 5)","(5, 5)","(2, 5)","(2, 5)","(5, 5)",...,,,,,,,,,,
3,3,"(3, 5)","(4, 5)","(3, 3)","(5, 5)","(2, 5)","(5, 3)","(2, 3)","(2, 3)","(5, 3)",...,,,,,,,,,,
4,4,"(3, 3)","(4, 5)","(3, 3)","(5, 3)","(2, 3)","(5, 3)","(2, 3)","(2, 3)","(5, 5)",...,,,,,,,,,,


From the snapshot of our data, we can see we have some missing values, mainly from the 'Unnamed' columns.
    


In [2]:
personality_scores_df = personality_scores_df.dropna(how = "all", axis="columns")
personality_scores_df.head()

Unnamed: 0,ID,Section 5 of 6 [I am always prepared.],Section 5 of 6 [I am easily disturbed.],Section 5 of 6 [I am exacting (demanding) in my work.],Section 5 of 6 [I am full of ideas.],Section 5 of 6 [I am interested in people.],Section 5 of 6 [I am not interested in abstract ideas.],Section 5 of 6 [I am not interested in other people's problems.],Section 5 of 6 [I am not really interested in others.],Section 5 of 6 [I am quick to understand things.],...,Section 5 of 6 [I often forget to put things back in their proper place],Section 5 of 6 [I pay attention to details.],Section 5 of 6 [I seldom feel blue (down).],Section 5 of 6 [I spend time reflecting on things.],Section 5 of 6 [I start conversations.],Section 5 of 6 [I sympathize with others' feelings.],Section 5 of 6 [I take time out for others.],Section 5 of 6 [I talk to a lot of different people at parties.],Section 5 of 6 [I use difficult words.],Section 5 of 6 [I worry about things.]
0,0,"(3, 5)","(4, 5)","(3, 5)","(5, 5)","(2, 3)","(5, 3)","(2, 3)","(2, 5)","(5, 5)",...,"(3, 5)","(3, 5)","(4, 3)","(5, 5)","(1, 3)","(2, 5)","(2, 5)","(1, 3)","(5, 1)","(4, 3)"
1,1,"(3, 5)","(4, 5)","(3, 5)","(5, 5)","(2, 5)","(5, 3)","(2, 5)","(2, 5)","(5, 5)",...,"(3, 5)","(3, 1)","(4, 1)","(5, 5)","(1, 5)","(2, 5)","(2, 5)","(1, 5)","(5, 3)","(4, 3)"
2,2,"(3, 5)","(4, 3)","(3, 3)","(5, 5)","(2, 5)","(5, 5)","(2, 5)","(2, 5)","(5, 5)",...,"(3, 5)","(3, 5)","(4, 1)","(5, 3)","(1, 3)","(2, 5)","(2, 5)","(1, 3)","(5, 1)","(4, 3)"
3,3,"(3, 5)","(4, 5)","(3, 3)","(5, 5)","(2, 5)","(5, 3)","(2, 3)","(2, 3)","(5, 3)",...,"(3, 1)","(3, 5)","(4, 1)","(5, 5)","(1, 5)","(2, 5)","(2, 5)","(1, 5)","(5, 1)","(4, 1)"
4,4,"(3, 3)","(4, 5)","(3, 3)","(5, 3)","(2, 3)","(5, 3)","(2, 3)","(2, 3)","(5, 5)",...,"(3, 5)","(3, 5)","(4, 5)","(5, 5)","(1, 3)","(2, 3)","(2, 5)","(1, 3)","(5, 1)","(4, 3)"


We dropped empty columns, to make the data cleaner and easier to work with.

In [3]:
num_unique_entries = len(personality_scores_df['ID'].unique())

assert len(personality_scores_df['ID']) == num_unique_entries


print(num_unique_entries)

1555



There are no repeating entries. All 1555 rows have unique id’s.

In [4]:
def traits_sum(row,n):
    total = 0
    for ch  in row:  
        if ch[1] == n:
            total += int(ch[4])
    return(total)

personality_traits_df= personality_scores_df.loc[:,'Section 5 of 6 [I am always prepared.]':'Section 5 of 6 [I worry about things.]']

personality_scores_df['Extraversion'] = personality_traits_df.apply(traits_sum,n='1',axis=1)
personality_scores_df['Agreableness'] = personality_traits_df.apply(traits_sum,n='2',axis=1)
personality_scores_df['Conscientiousness'] = personality_traits_df.apply(traits_sum,n='3',axis=1)
personality_scores_df['Neuroticism'] = personality_traits_df.apply(traits_sum,n='4',axis=1)
personality_scores_df['Imagination'] = personality_traits_df.apply(traits_sum,n='5',axis=1)

personality_scores_df.head()

Unnamed: 0,ID,Section 5 of 6 [I am always prepared.],Section 5 of 6 [I am easily disturbed.],Section 5 of 6 [I am exacting (demanding) in my work.],Section 5 of 6 [I am full of ideas.],Section 5 of 6 [I am interested in people.],Section 5 of 6 [I am not interested in abstract ideas.],Section 5 of 6 [I am not interested in other people's problems.],Section 5 of 6 [I am not really interested in others.],Section 5 of 6 [I am quick to understand things.],...,Section 5 of 6 [I sympathize with others' feelings.],Section 5 of 6 [I take time out for others.],Section 5 of 6 [I talk to a lot of different people at parties.],Section 5 of 6 [I use difficult words.],Section 5 of 6 [I worry about things.],Extraversion,Agreableness,Conscientiousness,Neuroticism,Imagination
0,0,"(3, 5)","(4, 5)","(3, 5)","(5, 5)","(2, 3)","(5, 3)","(2, 3)","(2, 5)","(5, 5)",...,"(2, 5)","(2, 5)","(1, 3)","(5, 1)","(4, 3)",30,40,48,36,42
1,1,"(3, 5)","(4, 5)","(3, 5)","(5, 5)","(2, 5)","(5, 3)","(2, 5)","(2, 5)","(5, 5)",...,"(2, 5)","(2, 5)","(1, 5)","(5, 3)","(4, 3)",42,46,46,40,42
2,2,"(3, 5)","(4, 3)","(3, 3)","(5, 5)","(2, 5)","(5, 5)","(2, 5)","(2, 5)","(5, 5)",...,"(2, 5)","(2, 5)","(1, 3)","(5, 1)","(4, 3)",28,40,40,38,42
3,3,"(3, 5)","(4, 5)","(3, 3)","(5, 5)","(2, 5)","(5, 3)","(2, 3)","(2, 3)","(5, 3)",...,"(2, 5)","(2, 5)","(1, 5)","(5, 1)","(4, 1)",30,38,38,40,38
4,4,"(3, 3)","(4, 5)","(3, 3)","(5, 3)","(2, 3)","(5, 3)","(2, 3)","(2, 3)","(5, 5)",...,"(2, 3)","(2, 5)","(1, 3)","(5, 1)","(4, 3)",28,34,46,38,36


We have separated the scores of each individual into the 5 personality subscales (Extraversion, Agreeableness, Conscientiousness, Neuroticism, and Imagination), these will help us rapidly identify 'high risk' individuals, who may not be best suited for the work environment.

In [5]:
department_df = pd.read_csv("../data/departments.csv",sep=';')
department_df["Department"].unique()


array(['Data', 'Web Dev', 'Copywriting', 'Design', 'Strategy', 'Web dev'],
      dtype=object)

There is a discrepancy with one of the department names, 'Web dev' repeats itself, just with a different letter case arrangement.


In [6]:
department_df["Department"] = department_df["Department"].apply(lambda x:x.lower())
personality_scores_df = pd.merge(department_df,personality_scores_df,on='ID')
personality_scores_df.head()


Unnamed: 0,ID,Department,Section 5 of 6 [I am always prepared.],Section 5 of 6 [I am easily disturbed.],Section 5 of 6 [I am exacting (demanding) in my work.],Section 5 of 6 [I am full of ideas.],Section 5 of 6 [I am interested in people.],Section 5 of 6 [I am not interested in abstract ideas.],Section 5 of 6 [I am not interested in other people's problems.],Section 5 of 6 [I am not really interested in others.],...,Section 5 of 6 [I sympathize with others' feelings.],Section 5 of 6 [I take time out for others.],Section 5 of 6 [I talk to a lot of different people at parties.],Section 5 of 6 [I use difficult words.],Section 5 of 6 [I worry about things.],Extraversion,Agreableness,Conscientiousness,Neuroticism,Imagination
0,0,data,"(3, 5)","(4, 5)","(3, 5)","(5, 5)","(2, 3)","(5, 3)","(2, 3)","(2, 5)",...,"(2, 5)","(2, 5)","(1, 3)","(5, 1)","(4, 3)",30,40,48,36,42
1,1,data,"(3, 5)","(4, 5)","(3, 5)","(5, 5)","(2, 5)","(5, 3)","(2, 5)","(2, 5)",...,"(2, 5)","(2, 5)","(1, 5)","(5, 3)","(4, 3)",42,46,46,40,42
2,2,data,"(3, 5)","(4, 3)","(3, 3)","(5, 5)","(2, 5)","(5, 5)","(2, 5)","(2, 5)",...,"(2, 5)","(2, 5)","(1, 3)","(5, 1)","(4, 3)",28,40,40,38,42
3,3,data,"(3, 5)","(4, 5)","(3, 3)","(5, 5)","(2, 5)","(5, 3)","(2, 3)","(2, 3)",...,"(2, 5)","(2, 5)","(1, 5)","(5, 1)","(4, 1)",30,38,38,40,38
4,4,data,"(3, 3)","(4, 5)","(3, 3)","(5, 3)","(2, 3)","(5, 3)","(2, 3)","(2, 3)",...,"(2, 3)","(2, 5)","(1, 3)","(5, 1)","(4, 3)",28,34,46,38,36


We have inputted the relevant department to each individual to have more data points on each individual, while ensuring that the department names all have the same spelling, ie. lowercases.

In [7]:
high_risk_df =personality_scores_df.loc[personality_scores_df['ID'].unique()]
high_risk_df = high_risk_df.loc[(high_risk_df['Conscientiousness'] < 30)& (high_risk_df['Agreableness'] < 30)&(high_risk_df['Neuroticism'] < 30)]
high_risk_df['risk'] = "high risk"

high_risk_df[['ID','Department']]

Unnamed: 0,ID,Department
881,881,data
1197,1197,copywriting



There are two high-risk individuals in this data set, these are people who may not be best suited for the work environment, people who score less than 30 on agreeableness, conscientiousness, and neuroticism.


In [8]:
high_risk_df= pd.concat([high_risk_df['ID'], high_risk_df['risk']],axis=1)
personality_scores_df =  pd.merge(high_risk_df,personality_scores_df,on="ID",how="outer")

personality_scores_df["risk"].fillna("low risk",inplace=True)
personality_scores_df.head()

Unnamed: 0,ID,risk,Department,Section 5 of 6 [I am always prepared.],Section 5 of 6 [I am easily disturbed.],Section 5 of 6 [I am exacting (demanding) in my work.],Section 5 of 6 [I am full of ideas.],Section 5 of 6 [I am interested in people.],Section 5 of 6 [I am not interested in abstract ideas.],Section 5 of 6 [I am not interested in other people's problems.],...,Section 5 of 6 [I sympathize with others' feelings.],Section 5 of 6 [I take time out for others.],Section 5 of 6 [I talk to a lot of different people at parties.],Section 5 of 6 [I use difficult words.],Section 5 of 6 [I worry about things.],Extraversion,Agreableness,Conscientiousness,Neuroticism,Imagination
0,881,high risk,data,"(3, 3)","(4, 1)","(3, 1)","(5, 5)","(2, 1)","(5, 3)","(2, 5)",...,"(2, 5)","(2, 3)","(1, 5)","(5, 1)","(4, 1)",30,28,26,28,36
1,1197,high risk,copywriting,"(3, 5)","(4, 5)","(3, 1)","(5, 1)","(2, 1)","(5, 3)","(2, 5)",...,"(2, 5)","(2, 1)","(1, 3)","(5, 5)","(4, 1)",40,22,26,26,28
2,0,low risk,data,"(3, 5)","(4, 5)","(3, 5)","(5, 5)","(2, 3)","(5, 3)","(2, 3)",...,"(2, 5)","(2, 5)","(1, 3)","(5, 1)","(4, 3)",30,40,48,36,42
3,1,low risk,data,"(3, 5)","(4, 5)","(3, 5)","(5, 5)","(2, 5)","(5, 3)","(2, 5)",...,"(2, 5)","(2, 5)","(1, 5)","(5, 3)","(4, 3)",42,46,46,40,42
4,2,low risk,data,"(3, 5)","(4, 3)","(3, 3)","(5, 5)","(2, 5)","(5, 5)","(2, 5)",...,"(2, 5)","(2, 5)","(1, 3)","(5, 1)","(4, 3)",28,40,40,38,42


With this snapshot of the data, we can quickly identify whether an individual is at high risk or not.

In [9]:
department_risk_df=personality_scores_df.groupby(['risk','Department']).ID.count()
department_risk_df.to_frame()

department_risk_df.unstack()


Department,copywriting,data,design,strategy,web dev
risk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
high risk,1.0,1.0,,,
low risk,325.0,328.0,120.0,449.0,331.0



Therefore, we found that there are two high-risk individuals, 1 in data science and another in copywriting.
Otherwise, most of the individuals who have taken the test seem to be low risk and are therefore eligible for work.
