<a href="https://colab.research.google.com/github/ArbabKhan-sudo/Nust_AI_Batch-1/blob/main/Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction of Pandas**

---

pandas is a fast, powerful and easy to use **open source** data analysis and data manipulation tool.


In [None]:
import pandas as pd

After the import command, we now have access to a large number of pre-built classes and functions.

The primary two components of pandas are the Series and DataFrame.

A Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection of Series.


**Creating DataFrames from scratch**

---


There are many ways to create a DataFrame from scratch, but a great option is to just use a simple dict.

In [None]:
data = {
    'apples': [3, 2, 0, 1],
    'oranges': [0, 3, 7, 2]
}

And then convert it into pandas DataFrame

In [None]:
purchases = pd.DataFrame(data)

purchases

Unnamed: 0,apples,oranges
0,3,0
1,2,3
2,0,7
3,1,2


Each (key, value) item in data corresponds to a column in the resulting DataFrame.

The Index of this DataFrame was given to us on creation as the numbers 0-3, but we could also create our own when we initialize the DataFrame.

Let's have customer names as our index:

In [None]:
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])

purchases

Unnamed: 0,apples,oranges
June,3,0
Robert,2,3
Lily,0,7
David,1,2


So now we could locate a customer's order by using their name:

In [None]:
purchases.loc['June'] #return the values at the certain location

apples     3
oranges    0
Name: June, dtype: int64

**Reading data from CSVs**


---


Let's go through the process to go from a comma separated values (.csv) file to a dataframe. This variable csv_path stores the path of the .csv, that is used as an argument to the read_csv function.

 The result is stored in the object df, this is a common short form used for a variable referring to a Pandas dataframe.

In [None]:
# Read data from CSV file
#C:\Users\arbab\Downloads
#C:/Users/arbab/Downloads/titanic_data.csv

csv_path = '/content/titanic_data.csv' #can use multiple formats e.g. xlsx,json etc
df = pd.read_csv(csv_path)
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [None]:
# Print first five rows of the dataframe

df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q



We can use the method head() to examine the first five rows of a dataframe

Check No. of Rows and Colomns in df

In [None]:
df.shape

(891, 12)

We can check insights of dataframe using describe.

In [None]:
# checking Numerical data
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


to check categorical stats

In [None]:
# checking Categorical data
df.describe(include="object")


Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Braund, Mr. Owen Harris",male,347082,B96 B98,S
freq,1,577,7,4,644


To check data types of each coloumn

In [None]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

To check Null Values in dataframe

In [None]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [None]:
#df.drop[' ']

# ** Viewing Data and Accessing Data**

---


You can also get a column as a series. You can think of a Pandas series as a 1-D dataframe. Just use one bracket:

In [None]:
# Get the column as a series

x = df['Name']
x

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

We can access the column 'Name' and assign it a new dataframe x:

In [None]:
# Access to the column Name

x = df[['Name']]
x

Unnamed: 0,Name
0,"Braund, Mr. Owen Harris"
1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,"Heikkinen, Miss. Laina"
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,"Allen, Mr. William Henry"
...,...
886,"Montvila, Rev. Juozas"
887,"Graham, Miss. Margaret Edith"
888,"Johnston, Miss. Catherine Helen ""Carrie"""
889,"Behr, Mr. Karl Howell"


In [None]:
# Get the column as a dataframe

x = type(df[['Name']])
x

pandas.core.frame.DataFrame

You can do the same thing for multiple columns; we just put the dataframe name, in this case, df, and the name of the multiple column headers enclosed in double brackets. The result is a new dataframe comprised of the specified columns:

In [None]:
# Access to multiple columns

y = df[['Name','Sex','Age']]
y

Unnamed: 0,Name,Sex,Age
0,"Braund, Mr. Owen Harris",male,22.0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0
2,"Heikkinen, Miss. Laina",female,26.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0
4,"Allen, Mr. William Henry",male,35.0
...,...,...,...
886,"Montvila, Rev. Juozas",male,27.0
887,"Graham, Miss. Margaret Edith",female,19.0
888,"Johnston, Miss. Catherine Helen ""Carrie""",female,
889,"Behr, Mr. Karl Howell",male,26.0


One way to access unique elements is the iloc method, where you can access the 1st row and the 1st column as follows:

In [None]:
# Access the value on the first row and the first column

df.iloc[0, 0] # index location 'iloc' is used with index

1

You can access the 2nd row and the 1st column as follows:

In [None]:
# Access the value on the second row and the first column

df.iloc[1,0]

2

You can access the column using the name as well, the following are the same as above:

In [None]:
# Access the column using the name

df.loc[0, 'Name']

'Braund, Mr. Owen Harris'

You can perform slicing using both the index and the name of the column:

In [None]:
# Slicing the dataframe

df.iloc[0:2, 0:3]

Unnamed: 0,PassengerId,Survived,Pclass
0,1,0,3
1,2,1,1


# **Group By**

---

Pandas dataframe.groupby() function is used to split the data into groups based on some criteria.

In [None]:
#apply groupby for a specific column
df.groupby(['Survived','Pclass']).mean()

  df.groupby(['Survived','Pclass']).mean()


Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId,Age,SibSp,Parch,Fare
Survived,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,1,410.3,43.695312,0.2875,0.3,64.684007
0,2,452.123711,33.544444,0.319588,0.14433,19.412328
0,3,453.580645,26.555556,0.672043,0.384409,13.669364
1,1,491.772059,35.368197,0.492647,0.389706,95.608029
1,2,439.08046,25.901566,0.494253,0.643678,22.0557
1,3,394.058824,20.646118,0.436975,0.420168,13.694887


In [None]:
#apply groupby for a specific column
df.groupby(['Survived','Pclass'])['Age'].mean()

Survived  Pclass
0         1         43.695312
          2         33.544444
          3         26.555556
1         1         35.368197
          2         25.901566
          3         20.646118
Name: Age, dtype: float64

# **Pivot**

---
Reshape data based on column values. Uses unique values from specified index / columns to form axes of the resulting DataFrame


df.pivot(index= 'field', columns = 'field',values = 'field')

In [None]:
df.pivot(index='PassengerId', columns='Sex', values='Survived')

Sex,female,male
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,,0.0
2,1.0,
3,1.0,
4,1.0,
5,,0.0
...,...,...
887,,0.0
888,1.0,
889,0.0,
890,,1.0


# **Pivot Table**

---

In [None]:
#Single value
pd.pivot_table(df,
               values='Survived',
               index=['Pclass'],
               columns=['Sex'],
               aggfunc='sum'
              )

Sex,female,male
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,91,45
2,70,17
3,72,47


In [None]:
#multiple value
pd.pivot_table(df,
               values='Survived',
               index=['Pclass','Embarked'],
               columns=['Sex'],
               aggfunc='sum'
              )

Unnamed: 0_level_0,Sex,female,male
Pclass,Embarked,Unnamed: 2_level_1,Unnamed: 3_level_1
1,C,42,17
1,Q,1,0
1,S,46,28
2,C,7,2
2,Q,2,0
2,S,61,15
3,C,15,10
3,Q,24,3
3,S,33,34


# **Melt**

---



In [None]:
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
                   'B': {0: 1, 1: 3, 2: 5},
                   'C': {0: 2, 1: 4, 2: 6}})
df

NameError: ignored

In [None]:
# unpivots a DataFrame from wide format to long format.
#
#

df.melt(id_vars=['A'], value_vars=['B'])

NameError: ignored

The names of ‘variable’ and ‘value’ columns can be customized:

In [None]:
df.melt(id_vars=['A'], value_vars=['B'],
        var_name='myVarname', value_name='myValname')

Unnamed: 0,A,myVarname,myValname
0,a,B,1
1,b,B,3
2,c,B,5
