## Pandas
Open source Data manipulation and Analysis library

Pandas 
- It provides some powerful objects like DataFrames and Series
- More streamlined handling of tabular data, and rich time series functionality
- Easy handling of missing data, data alignment, groupby, merge, and join methods

Numpy
- It provides us with a powerful object known as an Array
- Supports fast mathematical computation on arrays and matrices
- A wide range of mathematical array operations

![image.png](attachment:image.png)

In [1]:
import pandas as pd
import numpy as np

In [3]:
l=[3,5,7,3,9,2,3]
l

[3, 5, 7, 3, 9, 2, 3]

<mark> Panda Series </mark> - A Pandas Series is a one-dimensional array-like object that can hold data of any type (integer, string, float, Python objects, etc.). It is indexed by a set of labels, which can be strings, integers, or other Python objects.

A Pandas Series is similar to a NumPy array, but it has a few key differences. First, a Pandas Series is indexed, while a NumPy array is not. Second, a Pandas Series can hold data of any type, while a NumPy array is limited to numeric data. Third, a Pandas Series has a number of methods for handling missing data, while NumPy arrays do not.

Syntax: <br>
pd.Series(data, index=None, dtype=None, name=None) <br>
- data is a list, NumPy array, or dictionary of data values.
- index is a list of strings or integers that will be used to index the Series. If index is None, the default is a range from 0 to the length of the Series minus 1.
- dtype is the data type of the Series. If dtype is None, the default is inferred from the data.
- name is the name of the Series. If name is None, no name is assigned.

In [8]:
s=pd.Series(l)
s

0    3
1    5
2    7
3    3
4    9
5    2
6    3
dtype: int64

In [13]:
ind =['A','B','C','D','E','F','G']
s1=pd.Series(l,ind)

In [14]:
s1

A    3
B    5
C    7
D    3
E    9
F    2
G    3
dtype: int64

In [15]:
s1['D']

3

In [16]:
s1.index

Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')

In [17]:
s1.values

array([3, 5, 7, 3, 9, 2, 3], dtype=int64)

In [6]:
r=np.random.randint(1,50,(5,4))

In [7]:
r

array([[41, 23, 14, 29],
       [11, 33,  5, 12],
       [45, 12, 24, 22],
       [36, 35, 39, 34],
       [22, 23, 15, 45]])

<mark> DataFrame </mark> 
A Pandas DataFrame is a two-dimensional tabular data structure with labeled axes (rows and columns). It is similar to a spreadsheet or SQL table.
A DataFrame can be created from a variety of data sources, such as a list of lists, a dictionary, a NumPy array, or a database table.

Syntax : <br>
pd.DataFrame(data, columns=None, index=None)
- data is a list of lists, NumPy array, or dictionary of data values.
- columns is a list of strings that will be used as the column names. If columns is None, the default is a list of the strings 'column_0', 'column_1', and so on.
- index is a list of strings or integers that will be used to index the DataFrame. If index is None, the default is a range from 0 to the length of the DataFrame minus 1.

In [8]:
c=pd.DataFrame(l)
c

Unnamed: 0,0
0,3
1,5
2,7
3,3
4,9
5,2
6,3


In [11]:
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df

Unnamed: 0,a,b
0,1,4
1,2,5
2,3,6


In [44]:
d=pd.DataFrame(r)
d

Unnamed: 0,0,1,2,3
0,41,23,14,29
1,11,33,5,12
2,45,12,24,22
3,36,35,39,34
4,22,23,15,45


In [30]:
row=["a","b","c","d","e"]
column=['a','b','c','d']
s2 = pd.DataFrame(r, index=row,columns=column)
s2

Unnamed: 0,a,b,c,d
a,41,23,14,29
b,11,33,5,12
c,45,12,24,22
d,36,35,39,34
e,22,23,15,45


Indexing and Slicing

In [18]:
#loc
s2.loc['b']['c']   

5

In [19]:
s2.loc['b':'c']                    #remember when slicing with loc it will include both upper and lower index

Unnamed: 0,a,b,c,d
b,11,33,5,12
c,45,12,24,22


In [20]:
s2.loc['b':'d','a':'c']

Unnamed: 0,a,b,c
b,11,33,5
c,45,12,24
d,36,35,39


In [23]:
s2.loc['c':'d','b':'d']

Unnamed: 0,b,c,d
c,12,24,22
d,35,39,34


In [24]:
#get location with int index
s2.iloc[1][2]

5

In [28]:
s2.iloc[0:2]    # with iloc it will work like normal indexing

Unnamed: 0,a,b,c,d
a,41,23,14,29
b,11,33,5,12


In [31]:
s2.head(3)

Unnamed: 0,a,b,c,d
a,41,23,14,29
b,11,33,5,12
c,45,12,24,22


In [34]:
s2.drop('b', inplace=True)

In [35]:
s2

Unnamed: 0,a,b,c,d
a,41,23,14,29
c,45,12,24,22
d,36,35,39,34
e,22,23,15,45


In [45]:
print(d)
d.drop(1,axis=1, inplace=True)

    0   1   2   3
0  41  23  14  29
1  11  33   5  12
2  45  12  24  22
3  36  35  39  34
4  22  23  15  45


In [46]:
d

Unnamed: 0,0,2,3
0,41,14,29
1,11,5,12
2,45,24,22
3,36,39,34
4,22,15,45


Reading Csv

In [77]:
df=pd.read_csv('titanic.csv')
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [75]:
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [76]:
df.tail(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [78]:
df.shape

(891, 12)

In [82]:
df.columns    #column name

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [80]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [85]:
df['Sex'].unique()   #total number of unique values

array(['male', 'female'], dtype=object)

In [87]:
df['Pclass'].value_counts() #distribution of classes

Pclass
3    491
1    216
2    184
Name: count, dtype: int64

In [88]:
df['Sex'].value_counts()

Sex
male      577
female    314
Name: count, dtype: int64

In [90]:
Male= df[df['Sex']=='male']
Male

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
883,884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5000,,S
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [91]:
df[df['Age']>30]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...,...
873,874,0,3,"Vander Cruyssen, Mr. Victor",male,47.0,0,0,345765,9.0000,,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
881,882,0,3,"Markun, Mr. Johann",male,33.0,0,0,349257,7.8958,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q


In [92]:
#Checking null values
df.isnull()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


In [93]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [94]:
df[['Cabin']]    #double bracket will make it data frame

Unnamed: 0,Cabin
0,
1,C85
2,
3,C123
4,
...,...
886,
887,B42
888,
889,C148


In [97]:
df[['Cabin']] = df[['Cabin']].dropna()   #drop null values


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [101]:
m=df['Age'].mean()

In [102]:
df[['Age']]= df[['Age']].fillna(m)

Pivot Table

In [104]:
t1= df.pivot_table(index='Sex' ,values='Survived', aggfunc=sum)

In [105]:
t1

Unnamed: 0_level_0,Survived
Sex,Unnamed: 1_level_1
female,233
male,109


In [106]:
t2= df.pivot_table(index='Sex' ,values='Survived', aggfunc='mean')

In [107]:
t2

Unnamed: 0_level_0,Survived
Sex,Unnamed: 1_level_1
female,0.742038
male,0.188908


In [108]:
t3= df.pivot_table(index='Pclass' , values='Survived' , aggfunc='mean')

In [109]:
t3

Unnamed: 0_level_0,Survived
Pclass,Unnamed: 1_level_1
1,0.62963
2,0.472826
3,0.242363


In [114]:
t4=df.pivot_table(index=['Pclass','Sex'] , values='Survived' , aggfunc='mean')
t4

Unnamed: 0_level_0,Unnamed: 1_level_0,Survived
Pclass,Sex,Unnamed: 2_level_1
1,female,0.968085
1,male,0.368852
2,female,0.921053
2,male,0.157407
3,female,0.5
3,male,0.135447
