# Data Wrangling

**Data wrangling, sometimes referred to as data munging, 
is the process of transforming and mapping data from one "raw" data form into another 
format with the intent of making it more appropriate 
and valuable for a variety of downstream purposes such as analytics.**
*Wikipedia*

## Importing data

In [25]:
import pandas as pd
import numpy as np

In [26]:
personality_data = pd.read_csv('personality_scores.csv',sep=';')

In [27]:
personality_data.head()

Unnamed: 0,ID,Section 5 of 6 [I am always prepared.],Section 5 of 6 [I am easily disturbed.],Section 5 of 6 [I am exacting (demanding) in my work.],Section 5 of 6 [I am full of ideas.],Section 5 of 6 [I am interested in people.],Section 5 of 6 [I am not interested in abstract ideas.],Section 5 of 6 [I am not interested in other people's problems.],Section 5 of 6 [I am not really interested in others.],Section 5 of 6 [I am quick to understand things.],...,Unnamed: 60,Unnamed: 61,Unnamed: 62,Unnamed: 63,Unnamed: 64,Unnamed: 65,Unnamed: 66,Unnamed: 67,Unnamed: 68,IPIP_HIGH_RISK
0,0,"(3, 5)","(4, 5)","(3, 5)","(5, 5)","(2, 3)","(5, 3)","(2, 3)","(2, 5)","(5, 5)",...,,,,,,,,,,
1,1,"(3, 5)","(4, 5)","(3, 5)","(5, 5)","(2, 5)","(5, 3)","(2, 5)","(2, 5)","(5, 5)",...,,,,,,,,,,
2,2,"(3, 5)","(4, 3)","(3, 3)","(5, 5)","(2, 5)","(5, 5)","(2, 5)","(2, 5)","(5, 5)",...,,,,,,,,,,
3,3,"(3, 5)","(4, 5)","(3, 3)","(5, 5)","(2, 5)","(5, 3)","(2, 3)","(2, 3)","(5, 3)",...,,,,,,,,,,
4,4,"(3, 3)","(4, 5)","(3, 3)","(5, 3)","(2, 3)","(5, 3)","(2, 3)","(2, 3)","(5, 5)",...,,,,,,,,,,


In [28]:
# personality_data.info()

In [29]:
personality_data.isnull().sum()

ID                                                           0
Section 5 of 6 [I am always prepared.]                       0
Section 5 of 6 [I am easily disturbed.]                      0
Section 5 of 6 [I am exacting (demanding) in my work.]       0
Section 5 of 6 [I am full of ideas.]                         0
                                                          ... 
Unnamed: 65                                               1555
Unnamed: 66                                               1555
Unnamed: 67                                               1555
Unnamed: 68                                               1555
IPIP_HIGH_RISK                                            1555
Length: 70, dtype: int64

# Find Duplicate Rows based on selected columns

In [30]:
personality_data.ID.unique()

array([   0,    1,    2, ..., 1552, 1553, 1554])

In [31]:
personality_data[personality_data.duplicated(keep=False)]


Unnamed: 0,ID,Section 5 of 6 [I am always prepared.],Section 5 of 6 [I am easily disturbed.],Section 5 of 6 [I am exacting (demanding) in my work.],Section 5 of 6 [I am full of ideas.],Section 5 of 6 [I am interested in people.],Section 5 of 6 [I am not interested in abstract ideas.],Section 5 of 6 [I am not interested in other people's problems.],Section 5 of 6 [I am not really interested in others.],Section 5 of 6 [I am quick to understand things.],...,Unnamed: 60,Unnamed: 61,Unnamed: 62,Unnamed: 63,Unnamed: 64,Unnamed: 65,Unnamed: 66,Unnamed: 67,Unnamed: 68,IPIP_HIGH_RISK


In [32]:
personality_data[personality_data['ID'].duplicated() == True]

Unnamed: 0,ID,Section 5 of 6 [I am always prepared.],Section 5 of 6 [I am easily disturbed.],Section 5 of 6 [I am exacting (demanding) in my work.],Section 5 of 6 [I am full of ideas.],Section 5 of 6 [I am interested in people.],Section 5 of 6 [I am not interested in abstract ideas.],Section 5 of 6 [I am not interested in other people's problems.],Section 5 of 6 [I am not really interested in others.],Section 5 of 6 [I am quick to understand things.],...,Unnamed: 60,Unnamed: 61,Unnamed: 62,Unnamed: 63,Unnamed: 64,Unnamed: 65,Unnamed: 66,Unnamed: 67,Unnamed: 68,IPIP_HIGH_RISK


In [33]:
dups_id=personality_data.pivot_table(index=['ID'],aggfunc='size').sum()

print(dups_id)

1555


## Drop any duplicates that exist

In [34]:

data=personality_data.drop_duplicates(subset=['ID'])


### Comparing the length of the new data frame with old one that may have had duplicates.

*Python's assert statement is a debugging aid that tests a condition.
If the condition is true, it does nothing and your program just continues to execute.
But if the assert condition evaluates to false,
it raises an AssertionError exception with an optional error message.*


*The assert statement should show that the length of unique values of the original 
data is the same as the length of the  data set were duplicates are dropped.*

In [35]:
old_data=len(personality_data.ID.unique())

In [36]:
new_data=len(data)

In [37]:
assert old_data==new_data

In [38]:
print('The length of unique IDs is:',old_data)

The length of unique IDs is: 1555


In [39]:
print('The length of data set after dropping duplicates:',new_data)

The length of data set after dropping duplicates: 1555


*The unique values of old data set has the same length
as the new data set of which is the old data set with dropped duplicate values.*

In [40]:
# a= [eval(a[i]) for i in range(len(a))]



### Dropping columns with null values

In [41]:
data=data.dropna(axis='columns')
data

Unnamed: 0,ID,Section 5 of 6 [I am always prepared.],Section 5 of 6 [I am easily disturbed.],Section 5 of 6 [I am exacting (demanding) in my work.],Section 5 of 6 [I am full of ideas.],Section 5 of 6 [I am interested in people.],Section 5 of 6 [I am not interested in abstract ideas.],Section 5 of 6 [I am not interested in other people's problems.],Section 5 of 6 [I am not really interested in others.],Section 5 of 6 [I am quick to understand things.],...,Section 5 of 6 [I often forget to put things back in their proper place],Section 5 of 6 [I pay attention to details.],Section 5 of 6 [I seldom feel blue (down).],Section 5 of 6 [I spend time reflecting on things.],Section 5 of 6 [I start conversations.],Section 5 of 6 [I sympathize with others' feelings.],Section 5 of 6 [I take time out for others.],Section 5 of 6 [I talk to a lot of different people at parties.],Section 5 of 6 [I use difficult words.],Section 5 of 6 [I worry about things.]
0,0,"(3, 5)","(4, 5)","(3, 5)","(5, 5)","(2, 3)","(5, 3)","(2, 3)","(2, 5)","(5, 5)",...,"(3, 5)","(3, 5)","(4, 3)","(5, 5)","(1, 3)","(2, 5)","(2, 5)","(1, 3)","(5, 1)","(4, 3)"
1,1,"(3, 5)","(4, 5)","(3, 5)","(5, 5)","(2, 5)","(5, 3)","(2, 5)","(2, 5)","(5, 5)",...,"(3, 5)","(3, 1)","(4, 1)","(5, 5)","(1, 5)","(2, 5)","(2, 5)","(1, 5)","(5, 3)","(4, 3)"
2,2,"(3, 5)","(4, 3)","(3, 3)","(5, 5)","(2, 5)","(5, 5)","(2, 5)","(2, 5)","(5, 5)",...,"(3, 5)","(3, 5)","(4, 1)","(5, 3)","(1, 3)","(2, 5)","(2, 5)","(1, 3)","(5, 1)","(4, 3)"
3,3,"(3, 5)","(4, 5)","(3, 3)","(5, 5)","(2, 5)","(5, 3)","(2, 3)","(2, 3)","(5, 3)",...,"(3, 1)","(3, 5)","(4, 1)","(5, 5)","(1, 5)","(2, 5)","(2, 5)","(1, 5)","(5, 1)","(4, 1)"
4,4,"(3, 3)","(4, 5)","(3, 3)","(5, 3)","(2, 3)","(5, 3)","(2, 3)","(2, 3)","(5, 5)",...,"(3, 5)","(3, 5)","(4, 5)","(5, 5)","(1, 3)","(2, 3)","(2, 5)","(1, 3)","(5, 1)","(4, 3)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1550,1550,"(3, 5)","(4, 5)","(3, 1)","(5, 5)","(2, 5)","(5, 5)","(2, 5)","(2, 5)","(5, 5)",...,"(3, 1)","(3, 5)","(4, 1)","(5, 3)","(1, 5)","(2, 5)","(2, 3)","(1, 1)","(5, 1)","(4, 5)"
1551,1551,"(3, 3)","(4, 5)","(3, 5)","(5, 3)","(2, 5)","(5, 3)","(2, 3)","(2, 5)","(5, 5)",...,"(3, 3)","(3, 3)","(4, 1)","(5, 3)","(1, 3)","(2, 5)","(2, 5)","(1, 5)","(5, 1)","(4, 3)"
1552,1552,"(3, 5)","(4, 3)","(3, 5)","(5, 5)","(2, 5)","(5, 5)","(2, 3)","(2, 3)","(5, 5)",...,"(3, 3)","(3, 5)","(4, 5)","(5, 5)","(1, 5)","(2, 5)","(2, 5)","(1, 5)","(5, 3)","(4, 3)"
1553,1553,"(3, 5)","(4, 5)","(3, 5)","(5, 5)","(2, 5)","(5, 5)","(2, 5)","(2, 5)","(5, 5)",...,"(3, 5)","(3, 5)","(4, 1)","(5, 5)","(1, 5)","(2, 3)","(2, 3)","(1, 1)","(5, 1)","(4, 3)"


## converting all the columns into a list

### creating a dictionary

In [42]:
Dict={1:'Extraversion', 2:'Agreeableness', 3:'Conscientiousness',
      4:'Emotional Stability',5:'Intellect'}

In [43]:
a=data['Section 5 of 6 [I am always prepared.]'].tolist()
a= [eval(a[i]) for i in range(len(a))]


In [44]:
data=data.set_index('ID')

In [81]:
b=data.iloc[0].tolist()
b=[eval(b[i]) for i in range (len(b))]


### creating a function for addition

In [82]:
in_list=data.iloc[0].tolist()
in_list=[eval(in_list[i]) for i in range(len(in_list))]
totals = {}
for uid, x in in_list :
    if uid not in totals :
        totals[uid] = x
    else :
        totals[uid] += x

print(totals)

{3: 48, 4: 36, 5: 42, 2: 40, 1: 30}


In [49]:
pd.DataFrame.from_dict(totals, orient='index')

Unnamed: 0,0
3,40
4,38
5,42
2,40
1,28


In [83]:
list_of_rows=data.values.tolist()
for l in list_of_rows:
        l=[eval(l[i]) for i in range(len(l))]
       


In [85]:
# def my_function(list_of_rows):   
for l in list_of_rows:
    l=[eval(l[i]) for i in range(len(l))]
    totals = {}
    for uid, x in l:
        if uid not in totals :
            totals[uid] = x
        else :
            totals[uid] += x    
    print(totals)
# my_function(totals)   

{3: 48, 4: 36, 5: 42, 2: 40, 1: 30}
{3: 46, 4: 40, 5: 42, 2: 46, 1: 42}
{3: 40, 4: 38, 5: 42, 2: 40, 1: 28}
{3: 38, 4: 40, 5: 38, 2: 38, 1: 30}
{3: 46, 4: 38, 5: 36, 2: 34, 1: 28}
{3: 42, 4: 46, 5: 36, 2: 46, 1: 48}
{3: 50, 4: 36, 5: 42, 2: 44, 1: 38}
{3: 48, 4: 42, 5: 42, 2: 48, 1: 30}
{3: 48, 4: 44, 5: 48, 2: 46, 1: 40}
{3: 36, 4: 18, 5: 42, 2: 44, 1: 32}
{3: 46, 4: 38, 5: 40, 2: 44, 1: 30}
{3: 48, 4: 20, 5: 46, 2: 36, 1: 24}
{3: 36, 4: 20, 5: 38, 2: 46, 1: 36}
{3: 44, 4: 42, 5: 42, 2: 40, 1: 32}
{3: 42, 4: 18, 5: 36, 2: 40, 1: 24}
{3: 46, 4: 34, 5: 38, 2: 38, 1: 36}
{3: 48, 4: 34, 5: 42, 2: 46, 1: 28}
{3: 44, 4: 42, 5: 32, 2: 36, 1: 32}
{3: 36, 4: 14, 5: 44, 2: 40, 1: 24}
{3: 42, 4: 24, 5: 44, 2: 40, 1: 18}
{3: 50, 4: 42, 5: 36, 2: 50, 1: 30}
{3: 38, 4: 40, 5: 42, 2: 32, 1: 34}
{3: 50, 4: 30, 5: 48, 2: 46, 1: 30}
{3: 46, 4: 38, 5: 40, 2: 46, 1: 28}
{3: 38, 4: 42, 5: 36, 2: 46, 1: 28}
{3: 42, 4: 34, 5: 34, 2: 26, 1: 20}
{3: 40, 4: 42, 5: 48, 2: 46, 1: 40}
{3: 32, 4: 14, 5: 48, 2: 48,

{3: 44, 4: 38, 5: 36, 2: 34, 1: 28}
{3: 40, 4: 22, 5: 46, 2: 38, 1: 30}
{3: 50, 4: 32, 5: 46, 2: 50, 1: 36}
{3: 40, 4: 32, 5: 38, 2: 42, 1: 14}
{3: 42, 4: 40, 5: 46, 2: 36, 1: 40}
{3: 48, 4: 24, 5: 36, 2: 46, 1: 14}
{3: 44, 4: 40, 5: 38, 2: 48, 1: 34}
{3: 44, 4: 36, 5: 40, 2: 44, 1: 28}
{3: 50, 4: 42, 5: 42, 2: 44, 1: 44}
{3: 30, 4: 34, 5: 30, 2: 32, 1: 30}
{3: 44, 4: 38, 5: 46, 2: 44, 1: 34}
{3: 44, 4: 20, 5: 46, 2: 44, 1: 18}
{3: 50, 4: 34, 5: 46, 2: 42, 1: 30}
{3: 46, 4: 36, 5: 40, 2: 38, 1: 26}
{3: 30, 4: 46, 5: 46, 2: 46, 1: 42}
{3: 30, 4: 38, 5: 44, 2: 38, 1: 42}
{3: 38, 4: 24, 5: 34, 2: 36, 1: 34}
{3: 44, 4: 32, 5: 44, 2: 48, 1: 26}
{3: 48, 4: 40, 5: 46, 2: 46, 1: 50}
{3: 46, 4: 38, 5: 46, 2: 38, 1: 26}
{3: 40, 4: 24, 5: 42, 2: 46, 1: 40}
{3: 46, 4: 38, 5: 40, 2: 44, 1: 34}
{3: 46, 4: 44, 5: 42, 2: 44, 1: 32}
{3: 36, 4: 38, 5: 34, 2: 48, 1: 48}
{3: 50, 4: 20, 5: 38, 2: 44, 1: 18}
{3: 38, 4: 42, 5: 46, 2: 40, 1: 26}
{3: 44, 4: 28, 5: 34, 2: 48, 1: 32}
{3: 44, 4: 48, 5: 44, 2: 50,

{3: 50, 4: 40, 5: 44, 2: 44, 1: 36}
{3: 34, 4: 24, 5: 36, 2: 32, 1: 28}
{3: 38, 4: 34, 5: 38, 2: 42, 1: 28}
{3: 42, 4: 42, 5: 36, 2: 40, 1: 30}
{3: 46, 4: 36, 5: 30, 2: 50, 1: 26}
{3: 46, 4: 38, 5: 30, 2: 36, 1: 28}
{3: 44, 4: 42, 5: 48, 2: 44, 1: 30}
{3: 36, 4: 40, 5: 28, 2: 32, 1: 20}
{3: 48, 4: 36, 5: 36, 2: 46, 1: 34}
{3: 40, 4: 34, 5: 30, 2: 36, 1: 16}
{3: 34, 4: 38, 5: 30, 2: 44, 1: 18}
{3: 30, 4: 26, 5: 38, 2: 32, 1: 14}
{3: 44, 4: 34, 5: 36, 2: 50, 1: 42}
{3: 50, 4: 44, 5: 46, 2: 46, 1: 46}
{3: 42, 4: 34, 5: 38, 2: 46, 1: 34}
{3: 48, 4: 46, 5: 38, 2: 34, 1: 12}
{3: 34, 4: 26, 5: 36, 2: 50, 1: 24}
{3: 42, 4: 44, 5: 44, 2: 46, 1: 32}
{3: 44, 4: 34, 5: 28, 2: 46, 1: 36}
{3: 40, 4: 32, 5: 42, 2: 50, 1: 30}
{3: 24, 4: 32, 5: 40, 2: 42, 1: 14}
{3: 48, 4: 40, 5: 38, 2: 48, 1: 32}
{3: 40, 4: 20, 5: 50, 2: 36, 1: 18}
{3: 48, 4: 40, 5: 48, 2: 38, 1: 28}
{3: 42, 4: 44, 5: 46, 2: 42, 1: 46}
{3: 46, 4: 36, 5: 40, 2: 42, 1: 34}
{3: 48, 4: 44, 5: 50, 2: 46, 1: 32}
{3: 40, 4: 34, 5: 42, 2: 36,

{3: 42, 4: 36, 5: 40, 2: 36, 1: 20}
{3: 48, 4: 36, 5: 48, 2: 44, 1: 26}
{3: 48, 4: 32, 5: 42, 2: 36, 1: 32}
{3: 38, 4: 32, 5: 26, 2: 40, 1: 36}
{3: 48, 4: 32, 5: 44, 2: 40, 1: 32}
{3: 44, 4: 30, 5: 36, 2: 42, 1: 42}
{3: 46, 4: 22, 5: 44, 2: 48, 1: 30}
{3: 50, 4: 40, 5: 46, 2: 40, 1: 40}
{3: 38, 4: 26, 5: 30, 2: 36, 1: 24}
{3: 38, 4: 30, 5: 40, 2: 44, 1: 24}
{3: 40, 4: 42, 5: 38, 2: 44, 1: 40}
{3: 30, 4: 34, 5: 44, 2: 48, 1: 40}
{3: 44, 4: 28, 5: 32, 2: 46, 1: 22}
{3: 40, 4: 38, 5: 44, 2: 48, 1: 30}
{3: 40, 4: 34, 5: 42, 2: 38, 1: 34}
{3: 38, 4: 30, 5: 42, 2: 50, 1: 20}
{3: 42, 4: 36, 5: 38, 2: 42, 1: 32}
{3: 50, 4: 16, 5: 34, 2: 46, 1: 26}
{3: 40, 4: 40, 5: 38, 2: 46, 1: 32}
{3: 48, 4: 40, 5: 42, 2: 46, 1: 32}
{3: 42, 4: 36, 5: 42, 2: 46, 1: 32}
{3: 38, 4: 34, 5: 48, 2: 50, 1: 38}
{3: 44, 4: 36, 5: 44, 2: 44, 1: 18}
{3: 44, 4: 38, 5: 50, 2: 50, 1: 32}
{3: 48, 4: 48, 5: 48, 2: 50, 1: 32}
{3: 46, 4: 38, 5: 40, 2: 44, 1: 40}
{3: 48, 4: 40, 5: 44, 2: 44, 1: 48}
{3: 42, 4: 48, 5: 42, 2: 42,

{3: 48, 4: 40, 5: 40, 2: 40, 1: 38}
{3: 46, 4: 38, 5: 34, 2: 38, 1: 16}
{3: 46, 4: 44, 5: 34, 2: 44, 1: 26}
{3: 40, 4: 18, 5: 46, 2: 42, 1: 40}
{3: 42, 4: 42, 5: 36, 2: 40, 1: 38}
{3: 42, 4: 32, 5: 34, 2: 38, 1: 38}
{3: 42, 4: 32, 5: 36, 2: 34, 1: 24}
{3: 46, 4: 30, 5: 42, 2: 46, 1: 36}
{3: 44, 4: 28, 5: 42, 2: 50, 1: 34}
{3: 36, 4: 30, 5: 46, 2: 46, 1: 32}
{3: 48, 4: 40, 5: 46, 2: 48, 1: 32}
{3: 40, 4: 18, 5: 40, 2: 38, 1: 18}
{3: 36, 4: 28, 5: 50, 2: 50, 1: 44}
{3: 44, 4: 42, 5: 42, 2: 46, 1: 14}
{3: 48, 4: 42, 5: 42, 2: 34, 1: 28}
{3: 42, 4: 30, 5: 26, 2: 30, 1: 18}
{3: 30, 4: 24, 5: 32, 2: 36, 1: 30}
{3: 42, 4: 40, 5: 42, 2: 40, 1: 34}
{3: 44, 4: 26, 5: 38, 2: 46, 1: 30}
{3: 44, 4: 40, 5: 30, 2: 38, 1: 16}
{3: 44, 4: 34, 5: 38, 2: 48, 1: 48}
{3: 44, 4: 32, 5: 44, 2: 44, 1: 36}
{3: 48, 4: 38, 5: 40, 2: 32, 1: 34}
{3: 46, 4: 18, 5: 44, 2: 36, 1: 26}
{3: 46, 4: 40, 5: 32, 2: 42, 1: 30}
{3: 46, 4: 34, 5: 48, 2: 50, 1: 36}
{3: 48, 4: 44, 5: 40, 2: 48, 1: 28}
{3: 44, 4: 40, 5: 32, 2: 34,