# Pandas 

![SegmentLocal](https://media2.giphy.com/media/EatwJZRUIv41G/giphy.gif "segment")

A package that helps python become a better and more efficient source for data representation 

Built on top of the NumPy package (discussed previously!)

## Perks!! 

Good for cleaning data because allows for deleting and inserting of columns in DataFrames and higher dimensional objects 

Allows for either explicit or automatic data alignment, depending on what the user decides, but either way, the data will be aligned. 

Easy handling of missing data!

Intelligent and quick label-based indexing, and subsetting of large data sets

![SegmentLocal](https://media2.giphy.com/media/4527jA8ErzDAYkJ9cf/giphy.gif "segment")

In [3]:
### Steps : 1. Import the required libraries
#        2. Getting the data from a source
#        3. Remove unused/irrelevant columns
#        4. 
        

In [6]:
#Import it:
import numpy as np
import pandas as pd

In [4]:
#Tip 1 : If you want to check your working directory, use !pwd
!pwd

/Users/richavala/Desktop/CodeYourDreams/Develop_Curriculum/7.) Packages


Has two primary data structures! 
Pandas DataFrame is alomost like an excel table - rows and columns of data, however, the data in the columns must be of the same type. 

### Series

Series are 1 dimensional, and are created using a list:

In [None]:
series = pd.series([1, 2, 3])

### DataFrame 

DataFrames are 2 dimensional, and are created using a NumPy array

Example 1:

In [None]:
dates = pd.date_range('20130101', periods=6)
# Creates ['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
#               '2013-01-05', '2013-01-06'],

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
#  Creates          A         B         C         D
#2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
#2013-01-02  1.212112 -0.173215  0.119209 -1.044236
#2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
#2013-01-04  0.721555 -0.706771 -1.039575  0.271860
#2013-01-05 -0.424972  0.567020  0.276232 -1.087401
#2013-01-06 -0.673690  0.113648 -1.478427  0.524988



credit : https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

Example 2:

In [None]:
df2 = pd.DataFrame({'A': 1.,
                    'B': pd.Timestamp('20130102'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': np.array([3] * 4, dtype='int32'),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo'})
# Creates:
#     A          B    C  D      E    F
#0  1.0 2013-01-02  1.0  3   test  foo
#1  1.0 2013-01-02  1.0  3  train  foo
#2  1.0 2013-01-02  1.0  3   test  foo
#3  1.0 2013-01-02  1.0  3  train  foo


credit : https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

### Viewing Data

To view the top rows of the data:

In [None]:
data.head()

To view the bottom rows of the data:

In [None]:
data.tail()

For both of these, the parameter would be the number of rows you want to view!

In [58]:
#For this tutorial, we will use Soils.csv to learn about Pandas methods.
#After importing important libraries,
#import data file in the .csv (comma separated value) format
#We are going to use the famous Titanic Dataset which is available on Kaggle website. 

df = pd.read_csv("train.csv")


### Explore the dataframe

In [32]:
#To explore our DataFrame, use df.head() - returns the first 5 rows and df.tail() - returns the last 5 rows.

df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [33]:
df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


### To get the shape and dimensions of the dataframe - use df.ndim and df.shape, df.info(), df.describe()


In [34]:
# 1 for 1-dimension (series) and 2 for 2-dimension (dataframe)

df.ndim

2

In [35]:
#returns tuple of shape (Rows, columns) of data frame
#tuple is a Python object

df.shape

(891, 12)

In [36]:
#df.info() prints a concise summary of the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [37]:
#df.describe() prints a descriptive statisstics that summarizes the central tendency, dispersinon, and shape 
#of the data distribution. 

df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [38]:
#Another way to find out the column names of the dataframe

df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [40]:
#To get the counts of for each unique value in the selected column
df['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [65]:
#.isnull() method will list the null values in the selected column
df['Embarked'].isnull().head()

0    False
1    False
2    False
3    False
4    False
Name: Embarked, dtype: bool

In [42]:
#.isnull() and .sum() method will list counts of the null values in the selected column
df['Embarked'].isnull().sum()

2

In [43]:
#.notnull() and .sum() method will list counts of the non-null values in the selected column
df['Embarked'].notnull().sum()

889

In [44]:
# returns the uniquie values in the selected column 
df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

### Selecting Data

In [64]:
#selecting only one column
df['Fare'].head()

0     7.2500
1    71.2833
2     7.9250
3    53.1000
4     8.0500
Name: Fare, dtype: float64

In [66]:
#selecting multiple columns
df[['Pclass','Name','Sex']].head()

Unnamed: 0,Pclass,Name,Sex
0,3,"Braund, Mr. Owen Harris",male
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female
2,3,"Heikkinen, Miss. Laina",female
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female
4,3,"Allen, Mr. William Henry",male


In [67]:
#To select specific subsets of data, we use .loc and .iloc methods

#.loc is to select by the label based indexing of the rows and columns. 
#The following will select all rows and columns from Group-ph.

df.loc[:,'Survived' : 'Name'].head()

Unnamed: 0,Survived,Pclass,Name
0,0,3,"Braund, Mr. Owen Harris"
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,1,3,"Heikkinen, Miss. Laina"
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,0,3,"Allen, Mr. William Henry"


In [68]:
#iloc will select rows and columns by the integer based indexing
#the following code would select all rows and columns from 0 to 3, it will exclude column number 4
df.iloc[:,0:4].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name
0,1,0,3,"Braund, Mr. Owen Harris"
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,3,1,3,"Heikkinen, Miss. Laina"
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,5,0,3,"Allen, Mr. William Henry"


### Filtering based on some condition


In [None]:
#the following code will select rows where Contour value equals to "Top". The Boolean value of True/False us returned 
#from the code df["Contour"] == "Top"

df[df['Sex'] == 'male'].head()


In [72]:
#If we want to filter our data in vice versa
df[df['Sex'] != 'male'].head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [78]:
#Filtering under two or more conditions
#AND operator
df_fare = df["Fare"] < 100
df_sex = df["Sex"] == "female"
df[df_fare & df_sex]


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
14,15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,0,0,350406,7.8542,,S
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0000,,S
18,19,0,3,"Vander Planke, Mrs. Julius (Emelia Maria Vande...",female,31.0,1,0,345763,18.0000,,S


In [79]:
#Filtering under two or more conditions
# OR operator
df_fare = df["Fare"] < 100
df_sex = df["Sex"] == "female"
df[df_fare | df_sex]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


### Sorting the DataFrame on a column


In [70]:
#the method will bring the result in an ascending order
df.sort_values("Fare").head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
271,272,1,3,"Tornquist, Mr. William Henry",male,25.0,0,0,LINE,0.0,,S
597,598,0,3,"Johnson, Mr. Alfred",male,49.0,0,0,LINE,0.0,,S
302,303,0,3,"Johnson, Mr. William Cahoone Jr",male,19.0,0,0,LINE,0.0,,S
633,634,0,1,"Parr, Mr. William Henry Marsh",male,,0,0,112052,0.0,,S
277,278,0,2,"Parkes, Mr. Francis ""Frank""",male,,0,0,239853,0.0,,S
413,414,0,2,"Cunningham, Mr. Alfred Fleming",male,,0,0,239853,0.0,,S
674,675,0,2,"Watson, Mr. Ennis Hastings",male,,0,0,239856,0.0,,S
263,264,0,1,"Harrison, Mr. William",male,40.0,0,0,112059,0.0,B94,S
466,467,0,2,"Campbell, Mr. William",male,,0,0,239853,0.0,,S
732,733,0,2,"Knight, Mr. Robert J",male,,0,0,239855,0.0,,S


In [71]:
#ascending = False will bring the result in a descending order
df.sort_values("Fare", ascending = False).head(10)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
258,259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.3292,,C
737,738,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C
679,680,1,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0,1,PC 17755,512.3292,B51 B53 B55,C
88,89,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0,C23 C25 C27,S
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C23 C25 C27,S
341,342,1,1,"Fortune, Miss. Alice Elizabeth",female,24.0,3,2,19950,263.0,C23 C25 C27,S
438,439,0,1,"Fortune, Mr. Mark",male,64.0,1,4,19950,263.0,C23 C25 C27,S
311,312,1,1,"Ryerson, Miss. Emily Borie",female,18.0,2,2,PC 17608,262.375,B57 B59 B63 B66,C
742,743,1,1,"Ryerson, Miss. Susan Parker ""Suzette""",female,21.0,2,2,PC 17608,262.375,B57 B59 B63 B66,C
118,119,0,1,"Baxter, Mr. Quigg Edmond",male,24.0,0,1,PC 17558,247.5208,B58 B60,C


### Cleaning Data

In [61]:
#Replacing the data
#embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
#the following command will replace value "C" with "cherbourg" and the values are stored in new dataframe
df1 = df.replace({"C" : "Cherbourg"})
df1.head()




Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,Cherbourg
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [63]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
### Finding Null Values with .isnull()


In [80]:
df["Cabin"].isnull()


0       True
1      False
2       True
3      False
4       True
5       True
6      False
7       True
8       True
9       True
10     False
11     False
12      True
13      True
14      True
15      True
16      True
17      True
18      True
19      True
20      True
21     False
22      True
23     False
24      True
25      True
26      True
27     False
28      True
29      True
       ...  
861     True
862    False
863     True
864     True
865     True
866     True
867    False
868     True
869     True
870     True
871    False
872    False
873     True
874     True
875     True
876     True
877     True
878     True
879    False
880     True
881     True
882     True
883     True
884     True
885     True
886     True
887    False
888     True
889    False
890     True
Name: Cabin, Length: 891, dtype: bool

In [81]:
### Dealing with missing values


In [82]:
#Use .drop() method to drop a column
# axis = 1 drops a column whereas axis = 0 drops a row
df.drop(labels = ["Cabin"], axis=1).head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


In [83]:
### Filling missing values

In [87]:
df['Age']

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
5       NaN
6      54.0
7       2.0
8      27.0
9      14.0
10      4.0
11     58.0
12     20.0
13     39.0
14     14.0
15     55.0
16      2.0
17      NaN
18     31.0
19      NaN
20     35.0
21     34.0
22     15.0
23     28.0
24      8.0
25     38.0
26      NaN
27     19.0
28      NaN
29      NaN
       ... 
861    21.0
862    48.0
863     NaN
864    24.0
865    42.0
866    27.0
867    31.0
868     NaN
869     4.0
870    26.0
871    47.0
872    33.0
873    47.0
874    28.0
875    15.0
876    20.0
877    19.0
878     NaN
879    56.0
880    25.0
881    33.0
882    22.0
883    28.0
884    25.0
885    39.0
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [90]:
### missing values fill in with .fillna()

df["Age"].fillna("Unknown", inplace = True)

df['Age']

0           22
1           38
2           26
3           35
4           35
5      Unknown
6           54
7            2
8           27
9           14
10           4
11          58
12          20
13          39
14          14
15          55
16           2
17     Unknown
18          31
19     Unknown
20          35
21          34
22          15
23          28
24           8
25          38
26     Unknown
27          19
28     Unknown
29     Unknown
        ...   
861         21
862         48
863    Unknown
864         24
865         42
866         27
867         31
868    Unknown
869          4
870         26
871         47
872         33
873         47
874         28
875         15
876         20
877         19
878    Unknown
879         56
880         25
881         33
882         22
883         28
884         25
885         39
886         27
887         19
888    Unknown
889         26
890         32
Name: Age, Length: 891, dtype: object

In [91]:
### Combining Multiple DataFrames


In [92]:
df_A = pd.DataFrame([['Tom', 10], ['Nick', 15], ['juli', 14]])
df_B = pd.DataFrame([['Bob', 25], ['Becky', 20], ['Lee', 24]])
print(df_A)
print(df_B)

      0   1
0   Tom  10
1  Nick  15
2  juli  14
       0   1
0    Bob  25
1  Becky  20
2    Lee  24


In [None]:
#Concatenate data by column
pd.concat([df, df2], axis=1)
#Concatenate data by row
pd.concat([df, df2], axis=0)