# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint
### Data Munging with Pandas

#### Task overview
We have a file that contains the gender, height and weight information. A typical line of the file is:

"Male",66.3162319187446,170.593858104457

We want to store all such lines in convenient data structures as three separate items and be able to manipulate them.

### Run the below cell to download the dataset

In [None]:
!wget https://cdn.talentsprint.com/aiml/Experiment_related_data/Week0-part_gender_height_weight.csv
!wget https://cdn.talentsprint.com/aiml/Experiment_related_data/Week0-full_gender_height_weight.csv

In [None]:
# We take a look at the contents of the file 
# by using the shell command head
!head Week0-part_gender_height_weight.csv

We start by simply reading the file and storing it. But we want to skip the first line as it is a header and does not have data. We also want to store the data instead of merely printing it. But we will print the first ten items to verify that all is well.

In [None]:
import pandas as pd

In [None]:
PART_DATA = "Week0-part_gender_height_weight.csv"
FULL_DATA = "Week0-full_gender_height_weight.csv"

In [None]:
firstLine = True
data = []
for line in open(FULL_DATA):
    if firstLine:
        firstLine = False
    else:
        data.append(line)
print(data[:10])

As you can see, there are some extraneous stuff:
  1. A \n at the end of each line
  2. The double quotes around the gender
  3. Also the line is a single string separated by commas
  
We handle these issues in the next version 

In [None]:
firstLine = True
COMMA = ','
QUOTE = '"'
data = []
for line in open(PART_DATA):
    if firstLine:
        firstLine = False
    else:
        g, h, w= line.strip().split(COMMA)
        data.append([g.strip(QUOTE), float(h), float(w)])
print(data[:10])

Now we need to convert inches to cm, pounds to kg and round these to the nearest integer and we are done. Here is the final code to do the same.

In [None]:
firstLine = True
COMMA = ','
QUOTE = '"'
INCH2CM = 2.54
POUND2KG = 0.4536
data = []
for line in open(PART_DATA):
    if firstLine:
        firstLine = False
    else:
        g, h, w = line.strip().split(COMMA)
        g = g.strip(QUOTE)
        h_cm = int(float(h) * INCH2CM + 0.5)
        w_kg = int(float(w) * POUND2KG + 0.5)
        data.append([g, h_cm, w_kg])
print(data[:4])

One reason python is popular for Scientific Computing is the availability of libraries that do a lot of standard, grunt work in a few lines. We will see how the pandas library can make short work of all the above

In [None]:
import pandas as pd
pd.read_csv(PART_DATA)

As you can see, pandas gives you a nice display! It figured out the column titles and numbered the data also. It actually loads the data into a dataframe, and we can treat each column as a dictionary whose key is the column name and value is the actual data in the column. Note that the datatype has been inferred too.

In [None]:
data = pd.read_csv(FULL_DATA)
type(data['Gender'][0]), type(data['Height'][1]), type(data['Weight'][30])

In [None]:
data["Gender"][21]

In [None]:
data['Weight']

In [None]:
df = pd.read_csv(PART_DATA, header=0, names=["GEN", "HT", "WT"])
df.GEN

Pandas gives you even more flexibility as part of the read_csv function. We can attach converters to selected columns. 

In [None]:
def inches2cms(s):
    return int(float(s) * 2.54 + 0.5)
def pounds2kgs(s):
    return int(float(s) * 0.4536 + 0.5)

In [None]:
pd.read_csv(FULL_DATA, converters={'Height':inches2cms, 'Weight':pounds2kgs})

So our final code will be

In [None]:
import pandas as pd

def inches2cms(s):
    return int(float(s) * 2.54 + 0.5)
def pounds2kgs(s):
    return int(float(s) * 0.4536 + 0.5)
data = pd.read_csv(FULL_DATA, converters={'Height':inches2cms, 'Weight':pounds2kgs})
data[:10]

In [None]:
pd.read_csv?

Now it is very easy to plot the data.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

def inches2cms(s):
    return int(float(s) * 2.54 + 0.5)
def pounds2kgs(s):
    return int(float(s) * 0.4536 + 0.5)
data = pd.read_csv(FULL_DATA, converters={'Height':inches2cms, 'Weight':pounds2kgs})
plt.plot(data['Height'], data['Weight'], "r.")
plt.xlabel("Height cms")
plt.ylabel("Weight kgs")
plt.show()

In [None]:
male = data[data.Gender == "Male"]
female = data[data.Gender == "Female"]

In [None]:
plt.plot(male['Height'], male['Weight'], 'r.')
plt.plot(female['Height'], female['Weight'], 'g.')
plt.show()

In [None]:
plt.plot(female['Height'], female['Weight'], 'g.')
plt.plot(male['Height'], male['Weight'], 'r.')
plt.show()

In [None]:
d10 = data[:10]

In [None]:
d10

In [None]:
d10.iloc?

In [None]:
for line in d10:
    print(line)

In [None]:
for line in d10.values:
    print(line)

In [None]:
male = data[data.Gender=='Female']
male

In [None]:
d10[d10.Height ==170]

In [None]:
heavy=data[data.Weight >110]
tall = data[data.Height > 140]
tall

In [None]:
data.drop?