# Handling Strings, Dummies and Missing Values

## String manipulation

Here we are using the data from strings.csv, this data contains three columns, age, income_m and expenses

In [1]:
import pandas as pd
import numpy as np
from pandas import DataFrame
import os
os.chdir("C:\\Python Code\\Data Manipulation with Pandas\\Handling Strings")


In [2]:
# Modify the location to the folder where the files have been copied
st=pd.read_csv("Strings.csv")


In [3]:
print (st.head())


   Age    Income_M Expenses
0   10  Rs 12000/-    8,000
1   30  Rs 45000/-   21,000
2   34  Rs 39000/-   20,000
3   16   Rs 6000/-    2,000
4   19  Rs 20000/-   10,000


In [4]:
# Though the above ouput seems like it is numeric, applying methods like mean 
# results in an error
st['Income_M'].mean()


TypeError: Could not convert Rs 12000/-Rs 45000/-Rs 39000/-Rs 6000/-Rs 20000/-Rs 42000/-Rs 34000/-Rs 56000/-Rs 25000/-Rs 100000/-Rs 56000/-Rs 2000/-Rs 40000/-Rs 27000/-Rs 32000/-Rs 34000/-Rs 20000/-Rs 23000/-Rs 57000/-Rs 62000/- to numeric

In [5]:
# The field income contains chars like '/' '-' and 'Rs'
# Replace this using the 'replace' method with one
# symbol at a time
st['Income_M']=st['Income_M'].str.replace("Rs","")
print (st.head())


   Age  Income_M Expenses
0   10   12000/-    8,000
1   30   45000/-   21,000
2   34   39000/-   20,000
3   16    6000/-    2,000
4   19   20000/-   10,000


In [6]:
st['Income_M']=st['Income_M'].str.replace("/-","")
print (st.head())


   Age Income_M Expenses
0   10    12000    8,000
1   30    45000   21,000
2   34    39000   20,000
3   16     6000    2,000
4   19    20000   10,000


In [7]:
# The type of the column is still of type object
st['Income_M'].dtype


dtype('O')

In [8]:
# We can convert it into numeric using the to_numeric function.
st.Income_M=pd.to_numeric(st.Income_M)


In [9]:
st['Income_M'].dtype


dtype('int64')

In [10]:
# Now the methods like mean do not result in error. 
st.Income_M.mean()


36600.0

## Handling categorical variables

Categorical variables are those whose value is always from a fixed set of values. E.g. Days of the Week, Nationality etc. 
However the machine learning models can work only on numeric data. Hence, we have to convert the categorical variables into indicator variables.  

The best way to use a categorical data in a analysis is to convert it into a indicator or a 0,1 type of a variable.
Every value of the category will become a new column and the row will take the value of 1 or 0 for that column based on the original value. 

To explain this further, we are going to use the olympics medal data 

In [11]:
# Read the medal information into a dataframe. 
dummy=pd.read_csv("medal.csv",sep=',',header=0, encoding="latin")


In [12]:
# Display the first five rows using the head command 
dummy.head()


Unnamed: 0,City,Edition,Sport,Discipline,Athlete,NOC,Gender,Event,Event_gender,Medal
0,Athens,1896,Aquatics,Swimming,"HAJOS, Alfred",HUN,Men,100m freestyle,M,Gold
1,Athens,1896,Aquatics,Swimming,"HERSCHMANN, Otto",AUT,Men,100m freestyle,M,Silver
2,Athens,1896,Aquatics,Swimming,"DRIVAS, Dimitrios",GRE,Men,100m freestyle for sailors,M,Bronze
3,Athens,1896,Aquatics,Swimming,"MALOKINIS, Ioannis",GRE,Men,100m freestyle for sailors,M,Gold
4,Athens,1896,Aquatics,Swimming,"CHASAPIS, Spiridon",GRE,Men,100m freestyle for sailors,M,Silver


In [13]:
# A dummy variable is essentially a presence absence indicator variable. 
# For example, for this column medal, it consists of three values,
# gold, silver and bronze. The pandas get_dummies method will help 
# us to convert the values in the column medal as a separate 
# column of dummies. What that means is - for this column medal, it will 
# create three different columns, 1 for medal gold, 1 for 
# medal silver and 31for medal bronze. The medal_gold column will have 
# a presence and absence variable wherever the value gold occurs, 
# the medal_gold column will flag it as 1. If the value is not gold ,
# it will flag it as 0. Similarly, the medal_silver column wherever 
# the value silver occurs, the column will be 
# flagged as 1, else the values will be flagged as 0.
dummies = pd.get_dummies(dummy)


In [14]:
dummies.head()


Unnamed: 0,Edition,City_Amsterdam,City_Antwerp,City_Athens,City_Atlanta,City_Barcelona,City_Beijing,City_Berlin,City_Helsinki,City_London,...,Event_Ã©pÃ©e individual,Event_Ã©pÃ©e team,"Event_Ã©pÃ©e, amateurs and masters","Event_Ã©pÃ©e, masters",Event_gender_M,Event_gender_W,Event_gender_X,Medal_Bronze,Medal_Gold,Medal_Silver
0,1896,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
1,1896,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
2,1896,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
3,1896,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
4,1896,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1


In [15]:
dummies.iloc[:100,:]


Unnamed: 0,Edition,City_Amsterdam,City_Antwerp,City_Athens,City_Atlanta,City_Barcelona,City_Beijing,City_Berlin,City_Helsinki,City_London,...,Event_Ã©pÃ©e individual,Event_Ã©pÃ©e team,"Event_Ã©pÃ©e, amateurs and masters","Event_Ã©pÃ©e, masters",Event_gender_M,Event_gender_W,Event_gender_X,Medal_Bronze,Medal_Gold,Medal_Silver
0,1896,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
1,1896,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
2,1896,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
3,1896,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
4,1896,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,1896,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
96,1896,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
97,1896,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
98,1896,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0


## Handling missing values

A lot of time the data that we receive is not always clean. We have seen that python will accept only nans as missing. Handling missing values comes at the data pre-processing stage, it’s a part of the sanity check to ensure that the data that we have is robust enough.

To demonstrate the handling of missing values, we will use the credit data.

In [16]:
# Read the data from the credit.csv. Treat the values like 
# 'Missing' or empty string as nan
dat_m=pd.read_csv('Credit.csv',na_values=['Missing',""])


  exec(code_obj, self.user_global_ns, self.user_ns)


In [17]:
# We can use the method isnull() along with sum() to 
# compute the nulls in each column of the dataframe. 
dat_m.isnull().sum()


NPA Status                                  2
RevolvingUtilizationOfUnsecuredLines        2
age                                         2
Gender                                      2
Region                                      2
MonthlyIncome                           29733
Rented_OwnHouse                             2
Occupation                                  2
Education                                   2
NumberOfTime30-59DaysPastDueNotWorse        2
DebtRatio                                   2
MonthlyIncome.1                         29733
NumberOfOpenCreditLinesAndLoans             2
NumberOfTimes90DaysLate                     2
NumberRealEstateLoansOrLines                2
NumberOfTime60-89DaysPastDueNotWorse        2
NumberOfDependents                       3924
Good_Bad                                    2
dtype: int64

In [18]:
# The column 'age' has only 2 missing values, replace 
# it with the meadn of all the values the column.
# fillna method is used to fill the missing values.
dat_m['age']=dat_m['age'].fillna(np.mean(dat_m['age']))


If the data is large and the number of missing values is significantly low, then we can delete those rows. 
However, if the data is small , every row becomes valuable and we need to identify alternate methods to 
populate the data.

We could replace the missing values with mean, median or mode values. We can even hardcode values if necessary.