## **Introduction to Pandas for Data Analysis**

    • What Is Pandas
    • Pandas Operation
          - Slicing the data frame
          - Merging & Joining
          - Concatenation
          - Changing the index
          - Change Column headers
          - Data munging
      - Use-Case: 

### **What Is Pandas:**

Pandas is used for data manipulation, analysis and cleaning. Python pandas is well suited for different kinds of data, such as :

    - Tabular data with heterogeneously-typed columns
    - Ordered and unordered time series data
    - Arbitrary matrix data with row & column labels
    - Unlabelled data
    - Any other form of observational or statistical data sets

Installing Pandas:
To install Python Pandas, go to your command line/ terminal and type **pip install pandas** or else, if you have anaconda installed in your system, just type in  **conda install pandas**.

**- Terminal**

pip install pandas

**- Anaconda Prompt**

conda install pandas

Once the installation is completed, go to your IDE (Jupyter, PyCharm etc.) and simply import it by typing: **import pandas as pd**

In [31]:
import pandas as pd

If you are using jupyter/google colabs run, “!pip install pandas”. 

!pip install pandas

Python Pandas Operations

Using Python pandas, you can perform a lot of operations with series, data frames, missing data, group by etc.

 Some of the common operations for data manipulation are listed below:

1). Slicing

2). Merging & Joining

3). Concatenation

4). Changing the index 

5). Data Munging

**Slicing the Data Frame**

In order to perform slicing on data, you need a data frame. Data frame is a 2-dimensional data structure and a most common pandas object. 

In [32]:
import pandas as pd
XYZ_web = {'Day':[1,2,3,4,5,6], "Visitors":[1000, 700,6000,1000,400,350], "Bounce_Rate":[20,20, 23,15,10,34]}
df= pd.DataFrame(XYZ_web)
print(df)

   Day  Visitors  Bounce_Rate
0    1      1000           20
1    2       700           20
2    3      6000           23
3    4      1000           15
4    5       400           10
5    6       350           34


The code above, as it can be seen will convert a dictionary into a pandas Data Frame along with index to the left.

In [33]:
df.head(2)

Unnamed: 0,Day,Visitors,Bounce_Rate
0,1,1000,20
1,2,700,20


This is provide the first two rows, and if you want the last two rows use the command below: 

In [34]:
df.tail(2)

Unnamed: 0,Day,Visitors,Bounce_Rate
4,5,400,10
5,6,350,34


**Merging & Joining**

In merging, you can merge two data frames to form a single data frame. You can also decide which columns you want to make common.

Let implement that practically, first create three data frames, which has some key-value pairs and then merge the data frames together.

In [35]:
import pandas as pd
df1 = pd.DataFrame({ "HPI":[80,90,70,60], "Int_Rate":[2,1,2,3], "IND_GDP":[50,45,45,67]}, index=[2001, 2002,2003,2004])
df1

Unnamed: 0,HPI,Int_Rate,IND_GDP
2001,80,2,50
2002,90,1,45
2003,70,2,45
2004,60,3,67


In [36]:
import pandas as pd
df2= pd.DataFrame({ "HPI":[80,90,70,60],"Int_Rate":[2,1,2,3],"IND_GDP":[50,45,45,67]}, index=[2005, 2006,2007,2008])
df2

Unnamed: 0,HPI,Int_Rate,IND_GDP
2005,80,2,50
2006,90,1,45
2007,70,2,45
2008,60,3,67


In [37]:
merged= pd.merge(df1,df2)

merged

Unnamed: 0,HPI,Int_Rate,IND_GDP
0,80,2,50
1,90,1,45
2,70,2,45
3,60,3,67


As you can see above, the two data frames has merged into a single data frame. Now, you can also specify the column which you want to make common.

**Task 1: Make the “HPI” column to be common for everything else and separate columns.** 

**Solution:**

In [40]:
df1 = pd.DataFrame({"HPI":[80,90,70,60],"Int_Rate":[2,1,2,3], "IND_GDP":[50,45,45,67]}, index=[2001, 2002,2003,2004])

In [41]:
df1

Unnamed: 0,HPI,Int_Rate,IND_GDP
2001,80,2,50
2002,90,1,45
2003,70,2,45
2004,60,3,67


In [43]:
df2 = pd.DataFrame({"HPI":[80,90,70,60],"Int_Rate":[2,1,2,3], "IND_GDP":[50,45,45,67]}, index=[2005, 2006,2007,2008])

In [44]:
df2

Unnamed: 0,HPI,Int_Rate,IND_GDP
2005,80,2,50
2006,90,1,45
2007,70,2,45
2008,60,3,67


In [45]:
merged= pd.merge(df1,df2, on ="HPI")
merged

Unnamed: 0,HPI,Int_Rate_x,IND_GDP_x,Int_Rate_y,IND_GDP_y
0,80,2,50,2,50
1,90,1,45,1,45
2,70,2,45,2,45
3,60,3,67,3,67


**Join**

Join is a convenient method to combine two differently indexed dataframes into a single result dataframe. This is quite similar to the “merge” operation, except the joining operation will be on the “index” instead of the “columns”. 

In [47]:
df1 = pd.DataFrame({"Int_Rate":[2,1,2,3], "IND_GDP":[50,45,45,67]}, index=[2001, 2002,2003,2004])

In [48]:
df2 = pd.DataFrame({"Low_Tier_HPI":[50,45,67,34],"Unemployment":[1,3,5,6]}, index=[2001, 2003,2004,2004])

In [50]:
joined= df1.join(df2)
joined

Unnamed: 0,Int_Rate,IND_GDP,Low_Tier_HPI,Unemployment
2001,2,50,50.0,1.0
2002,1,45,,
2003,2,45,45.0,3.0
2004,3,67,67.0,5.0
2004,3,67,34.0,6.0


As you can see in your output, in year 2002(index), there is no value attached to columns “low_tier_HPI” and “unemployment”, therefore it has printed NaN (Not a Number). 

Later in 2004, both the values are available, therefore it has printed the respective values.

**Task 2: Make sure you can clearly differentiate merge and join in pandas.**

**Concatenation**

Concatenation basically glues the dataframes together. You can select the dimension on which you want to concatenate. For that, just use **pd.concat** and pass in the list of dataframes to concatenate together.

In [55]:
df1 = pd.DataFrame({"HPI":[80,90,70,60],"Int_Rate":[2,1,2,3], "IND_GDP":[50,45,45,67]}, index=[2001, 2002,2003,2004])

df2 = pd.DataFrame({"HPI":[80,90,70,60],"Int_Rate":[2,1,2,3],"IND_GDP":[50,45,45,67]}, index=[2005, 2006,2007,2008])

concat= pd.concat([df1,df2])
concat

Unnamed: 0,HPI,Int_Rate,IND_GDP
2001,80,2,50
2002,90,1,45
2003,70,2,45
2004,60,3,67
2005,80,2,50
2006,90,1,45
2007,70,2,45
2008,60,3,67


Notice that the two dataframes are glued together in a single dataframe, where the index starts from 2001 all the way upto 2008. 

You can also specify axis=1 in order to join, merge or cancatenate along the columns.

In [58]:
df1 = pd.DataFrame({"HPI":[80,90,70,60],"Int_Rate":[2,1,2,3], "IND_GDP":[50,45,45,67]}, index=[2001, 2002,2003,2004])

df2 = pd.DataFrame({"HPI":[80,90,70,60],"Int_Rate":[2,1,2,3],"IND_GDP":[50,45,45,67]}, index=[2005, 2006,2007,2008])

concat = pd.concat([df1,df2],axis=1)

concat

Unnamed: 0,HPI,Int_Rate,IND_GDP,HPI.1,Int_Rate.1,IND_GDP.1
2001,80.0,2.0,50.0,,,
2002,90.0,1.0,45.0,,,
2003,70.0,2.0,45.0,,,
2004,60.0,3.0,67.0,,,
2005,,,,80.0,2.0,50.0
2006,,,,90.0,1.0,45.0
2007,,,,70.0,2.0,45.0
2008,,,,60.0,3.0,67.0


Notice that there are bunch of missing values. This happens because the dataframes didn’t have values for all the indexes you want to concatenate on. Therefore, you should make sure that you have all the information lining up correctly when you join or concatenate on the axis.

**Change the index**

Now let understand how to change the index values in a dataframe. For example, let create a dataframe with some key value pairs in a dictionary and change the index values. 

Now let understand how to change the index values in a dataframe. For example, let create a dataframe with some key value pairs in a dictionary and change the index values.

In [62]:
import pandas as pd
df= pd.DataFrame({"Day":[1,2,3,4], "Visitors":[200, 100,230,300], "Bounce_Rate":[20,45,60,10]}) 

df.set_index("Day", inplace= True)
df

Unnamed: 0_level_0,Visitors,Bounce_Rate
Day,Unnamed: 1_level_1,Unnamed: 2_level_1
1,200,20
2,100,45
3,230,60
4,300,10


As you can see the index value has been changed with respect to the “Day” column.

**Change the Column Headers.**

Let  take the above example, where we will change the column header from “Visitors” to “Users”. 

In [65]:
import pandas as pd
df = pd.DataFrame({"Day":[1,2,3,4], "Visitors":[200, 100,230,300], "Bounce_Rate":[20,45,60,10]})

df = df.rename(columns={"Visitors":"Users"})
df

Unnamed: 0,Day,Users,Bounce_Rate
0,1,200,20
1,2,100,45
2,3,230,60
3,4,300,10


As you can see in the above output the column header “Visitors” has been changed to “Users”. 

**Data Munging**

In Data munging, you can convert a particular data into a different format. For example, if you have a .csv file, you can convert it into .html or any other data format as well.

In [72]:
import pandas as pd
country = pd.read_csv("data.csv",index_col=0)
country.to_html('index.html')

**Note**

Once you run the above code a HTML file will be created named “index.html”. You can directly copy the path of the file and paste it in your browser which displays the data in a HTML format.