<a href="https://colab.research.google.com/github/StranGer-48/Data-Analysis/blob/main/PandasTurorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Pandas is used for data manipulation, analysis and cleaning. Python pandas is well suited for different kinds of data, such as:
  * Tabular data with heterogeneously-typed columns
  * Ordered and unordered time series data
  * Arbitrary matric data with row and colums labels
  * Unlabelled data
  * Any other form of observational or statistical data sets

> Using Python pandas, you can perform a lot of operations with series, data frames, missing data, group by etc. Some of the common operation for data manipulation are listed below:
  * Slicing
  * Merginbg and Joining
  * Concatenation
  * Changing the index
  * Change column headers
  * Data Munging

In this we gonna discuss all the operation using Pyton panda libraries.

In [1]:
import pandas as pd

**DataFrame and Series**

In order to master pandas you have to start from scratch with two main data structures: DataFrame and Series. If you don't understand them well you won't understand pandas.


**Series**

Series is an object which is similar to Python build-in list data structure but differs from it because it has an associated label with each element or so-called index. This distinctive feature makes it look like an associated array or dictionary(hashmap representation).


In [3]:
my_series = pd.Series([5,6,7,8,9,10])
my_series

0     5
1     6
2     7
3     8
4     9
5    10
dtype: int64

Take a look at the output above and you will see that the index is leftward and values are to the right. If the index is not provided explicitly, then pandas create RangeIndex starting from 0 to N-1, where N is a total number of elements. Moreover, each Series object has a data type (dtype), in our case data type is int64.


Series has attributes to extract its values and index:

In [4]:
my_series.index

RangeIndex(start=0, stop=6, step=1)

In [5]:
my_series.values

array([ 5,  6,  7,  8,  9, 10])

**DataFrame**

DataFrame is a table. It has rows and columns. Each column in a DataFrame is a Series of objects, rows consist of elements inside the Series.

pd.DataFrame() will convert a dictionary into a Pandas Data Frame along with an index to the left.


DataFrame can be constructed using built-in Python dicts:

In [11]:
df = pd.DataFrame({'Day':[1,2,3,4,5,6],'Visitors':[1000,700,6000,1000,400,350],'Bounce_Rate':[20,20,23,15,10,34],'Rating':[5,4,4,3,5,4]})
df

Unnamed: 0,Day,Visitors,Bounce_Rate,Rating
0,1,1000,20,5
1,2,700,20,4
2,3,6000,23,4
3,4,1000,15,3
4,5,400,10,5
5,6,350,34,4


DataFrame object has 2 indexes: column index and row index. If you do not provide row index explicitly, pandas will create RangeIndex from 0 to N-1, where N is a number of rows inside DataFrame.

**Slicing the Data Frame**

In order to perform slicing on data, you need a data frame. Data frame is a 2-dimensional data structure and a most common pandas object.

pd.DataFrame() will convert a dictionary into a pandas Data Frame along with index to the left.

[pandas.DataFrame.head()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html?highlight=head#pandas.DataFrame.head) and [pandas.DataFrame.tail()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail) are two methods to slice the dataframe from the biginning and ending respectively. Default these method will return first five rows in a dataframe.

In [12]:
print(df.head())

   Day  Visitors  Bounce_Rate  Rating
0    1      1000           20       5
1    2       700           20       4
2    3      6000           23       4
3    4      1000           15       3
4    5       400           10       5


In [13]:
print(df.head(2))

   Day  Visitors  Bounce_Rate  Rating
0    1      1000           20       5
1    2       700           20       4


In [14]:
print(df.tail(2))

   Day  Visitors  Bounce_Rate  Rating
4    5       400           10       5
5    6       350           34       4


**Merging and Joining**
You can merge two data frames to form a single data frame. You can also decide which columns you want to make common.

[pandas.DataFrame.merge()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html#pandas.DataFrame.merge) method is used to merge two dataframes together.

In [15]:
df1= pd.DataFrame({ "HPI":[80,90,70,60],"Int_Rate":[2,1,2,3],"IND_GDP":[50,45,45,67]}, index=[2001, 2002,2003,2004])
df2=pd.DataFrame({ "HPI":[80,90,70,60],"Int_Rate":[2,1,2,3],"IND_GDP":[50,45,45,67]}, index=[2005, 2006,2007,2008])
merged= pd.merge(df1,df2) 
print(merged)

   HPI  Int_Rate  IND_GDP
0   80         2       50
1   90         1       45
2   70         2       45
3   60         3       67


You can also specify the column which you want to make common.

In [16]:
merged= pd.merge(df1,df2,on ="HPI")
 
print(merged)

   HPI  Int_Rate_x  IND_GDP_x  Int_Rate_y  IND_GDP_y
0   80           2         50           2         50
1   90           1         45           1         45
2   70           2         45           2         45
3   60           3         67           3         67


Joining in pythom pandas is to combine two differently indexed dataframes into a single result dataframe. This is quite similar to the "merge" operation, except the joining operation will be on the "index" instead of the "columns".

[pandas.DataFrame.join()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html?highlight=join#pandas.DataFrame.join) method is used to join two different dataframes by index.

In [17]:
df1 = pd.DataFrame({"Int_Rate":[2,1,2,3], "IND_GDP":[50,45,45,67]}, index=[2001, 2002,2003,2004])
 
df2 = pd.DataFrame({"Low_Tier_HPI":[50,45,67,34],"Unemployment":[1,3,5,6]}, index=[2001, 2003,2004,2004])
 
joined= df1.join(df2)
print(joined)


      Int_Rate  IND_GDP  Low_Tier_HPI  Unemployment
2001         2       50          50.0           1.0
2002         1       45           NaN           NaN
2003         2       45          45.0           3.0
2004         3       67          67.0           5.0
2004         3       67          34.0           6.0


You can observe that joining is done at indx not by columns. If the value are empty then those are filled with "NaN".

**Concatenation**

Concatenation basically glues the dataframes together. You can select the dimension on which you want to concatenate.

[pandas.concat()](https://pandas.pydata.org/docs/reference/api/pandas.concat.html?highlight=concat#pandas.concat) method is used to concatenate two dataframes.

In [18]:
df1= pd.DataFrame({ "HPI":[80,90,70,60],"Int_Rate":[2,1,2,3],"IND_GDP":[50,45,45,67]}, index=[2001, 2002,2003,2004])
df2=pd.DataFrame({ "HPI":[80,90,70,60],"Int_Rate":[2,1,2,3],"IND_GDP":[50,45,45,67]}, index=[2005, 2006,2007,2008])
concat = pd.concat([df1,df2])
print(concat)

      HPI  Int_Rate  IND_GDP
2001   80         2       50
2002   90         1       45
2003   70         2       45
2004   60         3       67
2005   80         2       50
2006   90         1       45
2007   70         2       45
2008   60         3       67


You can also specify axis=1 in order to join, merge or concatenate along the columns.

In [19]:
concat = pd.concat([df1,df2],axis=1)
print(concat)

       HPI  Int_Rate  IND_GDP   HPI  Int_Rate  IND_GDP
2001  80.0       2.0     50.0   NaN       NaN      NaN
2002  90.0       1.0     45.0   NaN       NaN      NaN
2003  70.0       2.0     45.0   NaN       NaN      NaN
2004  60.0       3.0     67.0   NaN       NaN      NaN
2005   NaN       NaN      NaN  80.0       2.0     50.0
2006   NaN       NaN      NaN  90.0       1.0     45.0
2007   NaN       NaN      NaN  70.0       2.0     45.0
2008   NaN       NaN      NaN  60.0       3.0     67.0


 You can see there are bunch of missing values. This happens because the dataframes didn't have values for all the indexes you want to concatenate on.

**Aggregating and Grouping**

Grouping is probably one of the most popular methods in data analysis. If you want to group data in pandas you have to use .groupby method.

In [20]:
df

Unnamed: 0,Day,Visitors,Bounce_Rate,Rating
0,1,1000,20,5
1,2,700,20,4
2,3,6000,23,4
3,4,1000,15,3
4,5,400,10,5
5,6,350,34,4


Let's calculate how rating is given by the visitors, we will use *.groupby* as stated above.

In [29]:
print(df.groupby(['Rating','Day'])['Visitors'].sum())

Rating  Day
3       4      1000
4       2       700
        3      6000
        6       350
5       1      1000
        5       400
Name: Visitors, dtype: int64


**Change the Index**

[pandas.DataFrame.set_index()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html?highlight=set_index#pandas.DataFrame.set_index) using one or more existing columns or arrays(of the correct length). The index can replace the existing index or expand on it.



In [None]:
df= pd.DataFrame({"Day":[1,2,3,4], "Visitors":[200, 100,230,300], "Bounce_Rate":[20,45,60,10]})  
df.set_index("Day", inplace= True) 
print(df)

     Visitors  Bounce_Rate
Day                       
1         200           20
2         100           45
3         230           60
4         300           10


**Change the Column Headers**

[pandas.DataFrame.rename()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html?highlight=rename#pandas.DataFrame.rename) methos is used to change the headers of columns.

In [None]:
df = df.rename(columns={"Visitors":"Users"})
print(df)

     Uers  Bounce_Rate
Day                   
1     200           20
2     100           45
3     230           60
4     300           10


You can see column header "Visitors" is changed to "Users".

**Data Munging**

In Data munging, you can convert a particular data into different format. For example, if you habe a .csv file, you can convert it to .html or ant other data format as well.

In [None]:
country=pd.read_csv("File_path/file_name.csv")
country.to_html('file_name.html')

These are some basic python pandas operations and methods used in data analysis. You an check the pandas doxcumentation [here](https://pandas.pydata.org/docs/reference/index.html#api)