## Data Analytics - Crash Course for Beginners

## Data Manipulation Using Pandas Library

## Learning Objectives
```
Introduction to Pandas
Installation of Pandas
Pandas Objects
Pandas Sort
Working with Text Data
Statistical Function
Indexing and Selecting Data
```

## Introduction to Pandas
```
andas is an open-source Python library that uses powerful data structures to provide high-performance data manipulation and analysis.
It provides a variety of data structures and operations for manipulating numerical data and time series.
This library is based on the NumPy library.
```

## Introducing Pandas Objects
```
Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices  ​
There are three fundamental Pandas data structures:  ​
Series​
DataFrame ​
Index.  ​
```

## What is a Series?
```
Pandas Series is a labelled one-dimensional array that can hold any type of data (integer, string, float, Python objects, and so on).
Pandas Series is simply a column in an Excel spreadsheet.
Using the Series() method, we can easily convert a list, tuple, or dictionary into a Series.
```

## Creating a Series

In [4]:
import numpy as np
import pandas as pd

In [5]:
# Creating empty series. 
ser = pd.Series()
print(ser)
# simple array
data = np.array(['g', 'e', 'e', 'k', 's'])
ser = pd.Series(data) 
print(ser)

Series([], dtype: float64)
0    g
1    e
2    e
3    k
4    s
dtype: object


  ser = pd.Series()


## Creating a series from Lists:

In [6]:
import pandas as pd
# a simple list
list = ['g', 'e', 'e', 'k', 's']
# create series form a list 
ser = pd.Series(list) 
print(ser)

0    g
1    e
2    e
3    k
4    s
dtype: object


## Pandas Index
```
Pandas Index is an efficient tool for extracting particular rows and columns of data from a DataFrame.
Its job is to organise data and make it easily accessible.
We can also define an index, similar to an address, through which we can access any data in the Series or DataFrame.  
```

## Creating index
Creating index
First, we have to take a csv file that consist some data used for indexing.

In [2]:
# importing pandas package 
import pandas as pd 
data = pd.read_csv("airlines.csv")
data

Unnamed: 0,Airport.Code,Airport.Name,Time.Label,Time.Month,Time.Month Name,Time.Year,Statistics.# of Delays.Carrier,Statistics.# of Delays.Late Aircraft,Statistics.# of Delays.National Aviation System,Statistics.# of Delays.Security,...,Statistics.Flights.Delayed,Statistics.Flights.Diverted,Statistics.Flights.On Time,Statistics.Flights.Total,Statistics.Minutes Delayed.Carrier,Statistics.Minutes Delayed.Late Aircraft,Statistics.Minutes Delayed.National Aviation System,Statistics.Minutes Delayed.Security,Statistics.Minutes Delayed.Total,Statistics.Minutes Delayed.Weather
0,ATL,"Atlanta, GA: Hartsfield-Jackson Atlanta Intern...",2003/06,6,June,2003,1009,1275,3217,17,...,5843,27,23974,30060,61606,68335,118831,518,268764,19474
1,BOS,"Boston, MA: Logan International",2003/06,6,June,2003,374,495,685,3,...,1623,3,7875,9639,20319,28189,24400,99,77167,4160
2,BWI,"Baltimore, MD: Baltimore/Washington Internatio...",2003/06,6,June,2003,296,477,389,8,...,1245,15,6998,8287,13635,26810,17556,278,64480,6201
3,CLT,"Charlotte, NC: Charlotte Douglas International",2003/06,6,June,2003,300,472,735,2,...,1562,14,7021,8670,14763,23379,23804,127,65865,3792
4,DCA,"Washington, DC: Ronald Reagan Washington National",2003/06,6,June,2003,283,268,487,4,...,1100,18,5321,6513,13775,13712,20999,120,52747,4141
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4403,SAN,"San Diego, CA: San Diego International",2016/01,1,January,2016,280,397,171,2,...,871,18,5030,6016,15482,23023,6021,79,46206,1601
4404,SEA,"Seattle, WA: Seattle/Tacoma International",2016/01,1,January,2016,357,513,351,2,...,1274,31,8330,9739,25461,32693,11466,73,74017,4324
4405,SFO,"San Francisco, CA: San Francisco International",2016/01,1,January,2016,560,947,2194,2,...,3825,20,8912,13206,43641,72557,153416,66,278610,8930
4406,SLC,"Salt Lake City, UT: Salt Lake City International",2016/01,1,January,2016,338,540,253,3,...,1175,14,7426,8699,32066,33682,8057,57,76978,3116


## Pandas DataFrame

```
Panda has A two-dimensional data structure with corresponding labels is known as a dataframe. Spreadsheets used in Excel or Calc or SQL tables are similar to DataFrames. Pandas DataFrame consists of three main components: the data, the index, and the columns.
```

## Creating a Pandas DataFrame
Creating a dataframe using List:  DataFrame can be created using a single list or a list of lists.

In [3]:
#import pandas as pd import pandas as pd
# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is', 'portal', 'for', 'Geeks']
# Calling DataFrame constructor on list 
df = pd.DataFrame(lst)
print(df)


        0
0   Geeks
1     For
2   Geeks
3      is
4  portal
5     for
6   Geeks


## Creating DataFrame from dict of ndarray/lists : To generate a DataFrame from a dict of narrays/lists, each narray must be the same length.

In [4]:
# Python code demonstrate creating
# DataFrame from dict narray / lists #By default addresses.
import pandas as pd
# intialise data of lists.
data = { 'Name': ['Tom', 'nick', 'krish', 'jack'], 
        'Age': [20, 21, 19, 18]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
print(df)

    Name  Age
0    Tom   20
1   nick   21
2  krish   19
3   jack   18


## Reindexing
```
Reindexing modifies the row and column labels of a DataFrame.
It denotes verifying that the data corresponds to a specific set of labels along an established axis.Indexing enables us to carry out a variety of operations, including:-
Insert missing value (NaN) markers in label locations where there was previously no data for the label.
To reorder existing data to correspond to a new set of labels.
To reindex the dataframe, use the reindex() function.
Values in the new index that do not have matching records in the dataframe are by default given the value NaN.
```

In [12]:
import pandas as pd
# Create dataframe
info = pd.DataFrame({"P":[4, 7, 1, 8, 9], 
                     "Q":[6, 8, 10, 15, 11], 
                     "R":[17, 13, 12, 16, 14], 
                     "S":[15, 19, 7, 21, 9]}, 
                    index =["Parker", "William", "Smith", "Terry", "Phill"])
#Print dataframe
info

Unnamed: 0,P,Q,R,S
Parker,4,6,17,15
William,7,8,13,19
Smith,1,10,12,7
Terry,8,15,16,21
Phill,9,11,14,9


## Now, we can use the dataframe.reindex() function to reindex the dataframe.

In [6]:
# reindexing with new index values 
info.reindex(["A", "B", "C", "D", "E"])

Unnamed: 0,P,Q,R,S
A,,,,
B,,,,
C,,,,
D,,,,
E,,,,


```
Notice that the new indexes are populated with NaN values.
We can fill in the missing values using the fill_value parameter.
```

In [7]:
# reindexing with new index values 
info.reindex(["A", "B", "C", "D", "E"])

Unnamed: 0,P,Q,R,S
A,,,,
B,,,,
C,,,,
D,,,,
E,,,,


In [8]:
# filling the missing values by 100 
info.reindex (["A", "B", "C", "D", "E"], fill_value =100)

Unnamed: 0,P,Q,R,S
A,100,100,100,100
B,100,100,100,100
C,100,100,100,100
D,100,100,100,100
E,100,100,100,100


## Panda Sort
```
There are two kinds of sorting available in Pandas. They are –
By label
By Actual Value
By Label - When using the sort_index() method, DataFrame can be sorted by passing the axis arguments and the sorting order. Row labels are sorted by default in ascending order.
```

In [9]:
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame(np.random.randn (10,2), index= [1,4,6,2,3,5,9,8,8,7], columns = ['co12', 'col1'])
sorted_df = unsorted_df.sort_index() 
print (sorted_df)

       co12      col1
1 -1.077208 -0.278496
2  1.355789  0.595029
3 -0.624221  0.541755
4 -1.311704  0.135944
5 -0.561458 -0.738655
6  0.531195  0.580819
7  0.156011 -2.201475
8  1.596997 -1.126261
8  0.182262  1.062170
9  0.906981 -1.241922


## Order of Sorting
The order of sorting can be controlled by passing a Boolean value to the ascending parameter. To better understand this, consider the following example.

In [14]:
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame(np.random.randn (10,2), index= [1,4,6,2,3,5,9,8,8,7], columns = ['col2', 'col1'])
sorted_df = unsorted_df.sort_index(ascending=True)   # we can also use descending order so indexing will be False
print (sorted_df) 

       col2      col1
1 -2.446904  0.187454
2 -0.120554 -0.123544
3  0.521741  0.666213
4  2.140242 -1.753672
5 -0.069389  0.858318
6  0.188284  0.954180
7  0.078717 -1.092112
8 -0.536986 -1.088562
8  0.816096  1.378297
9 -0.854985  0.395582


## Sort the Columns
Sorting on the column labels is possible by passing the axis argument a value of 0 or 1. Sort by row by default, axis=0. To better understand this, consider the following example.

In [18]:
import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame(np.random.randn(10, 2), index=[1,4,6,2,3,5,9,8,8,7], columns = ['col2', 'col1'])

sorted_df=unsorted_df.sort_index(axis=1)

print (sorted_df)

       col1      col2
1 -0.097883  0.105165
4 -0.020545 -0.854819
6 -1.037343  0.250394
2 -1.436486 -0.017648
3 -0.334364  1.895937
5 -2.265965  2.013738
9  0.532733  0.980681
8  1.385529 -0.210616
8  0.944115 -0.635853
7 -0.741651  0.806541


In [22]:
import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame(np.random.randn(10, 2), index=[1,4,6,2,3,5,9,8,8,7], columns = ['col2', 'col1'])
print(unsorted_df)
sorted_df=unsorted_df.sort_index(axis=0)

print (sorted_df)

       col2      col1
1  0.639054  0.538377
4  0.641302 -0.486712
6 -0.646616 -0.134029
2 -1.276167 -1.362598
3  0.043373  0.535828
5 -0.807689 -0.684248
9 -0.394312 -0.038692
8  0.734735  1.317872
8  0.838763 -0.354872
7  0.584525  0.867051
       col2      col1
1  0.639054  0.538377
2 -1.276167 -1.362598
3  0.043373  0.535828
4  0.641302 -0.486712
5 -0.807689 -0.684248
6 -0.646616 -0.134029
7  0.584525  0.867051
8  0.734735  1.317872
8  0.838763 -0.354872
9 -0.394312 -0.038692


## By Value
Sort_values(), like index sorting, is a method for sorting by values. It accepts a 'by' argument, which will use the column name of the DataFrame to sort the values.

In [21]:
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame({'col1': [2,1,1,1], 'col2': [1,3,2,4]}) 
print(unsorted_df)

sorted_df = unsorted_df.sort_values (by='col1')
print (sorted_df)

   col1  col2
0     2     1
1     1     3
2     1     2
3     1     4
   col1  col2
1     1     3
2     1     2
3     1     4
0     2     1


## Working with Text Data
```
Working with string data is made simple by a set of string functions that are part of Pandas.
Most importantly, these functions ignore (or exclude) missing/NaN values.
Watch each operation now to see how it does.
```
![download.png](attachment:download.png)

In [23]:
import pandas as pd
import numpy as np
s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234', 'SteveSmith'])
print (s. str.lower())

0             tom
1    william rick
2            john
3         alber@t
4             NaN
5            1234
6      stevesmith
dtype: object


In [24]:
import pandas as pd
import numpy as np
s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234', 'SteveSmith'])
print (s. str.upper())

0             TOM
1    WILLIAM RICK
2            JOHN
3         ALBER@T
4             NaN
5            1234
6      STEVESMITH
dtype: object


## Statistical Functions
```
Using pandas, it is simple to simplify numerous complex statistical operations in Python to a single line of code.
Some of the most popular and practical statistical operations will be covered.
```![download.png](attachment:download.png)

## Pandas sum() method

In [28]:
import pandas as pd
# Dataset
data = {
'Maths' :[90, 85, 98, 80, 55, 78],
'Science': [92, 87, 59, 64, 87, 96], 'English': [95, 94, 84, 75, 67, 65]
}
# DataFrame
df = pd.DataFrame(data)
# Display the DataFrame 
print("DataFrame = \n",df)
# Display the Sum of Marks in each column 
print("\nSum = \n",df.sum())

DataFrame = 
    Maths  Science  English
0     90       92       95
1     85       87       94
2     98       59       84
3     80       64       75
4     55       87       67
5     78       96       65

Sum = 
 Maths      486
Science    485
English    480
dtype: int64


## Pandas count() method

In [29]:
import pandas as pd
# Dataset
data = {
'Maths': [90, 85, 98, None, 55, 78],
'Science': [92, 87, 59, None, None, 96],
'English': [95, None, 84, 75, 67, None]
}
# DataFrame
df = pd.DataFrame(data)
# Display the DataFrame 
print("DataFrame = \n", df)
# Display the Count of non-empty values in each column 
print("\nCount of non-empty values = \n", df.count())

DataFrame = 
    Maths  Science  English
0   90.0     92.0     95.0
1   85.0     87.0      NaN
2   98.0     59.0     84.0
3    NaN      NaN     75.0
4   55.0      NaN     67.0
5   78.0     96.0      NaN

Count of non-empty values = 
 Maths      5
Science    4
English    4
dtype: int64


## Pandas max() method

In [30]:
import pandas as pd
# Dataset
data = { 'Maths': [90, 85, 98, 80, 55, 78],
'Science': [92, 87, 59, 64, 87, 96],
'English': [95, 94, 84, 75, 67, 65]
}
#DataFrame
df = pd.DataFrame(data)
# Display the DataFrame 
print("DataFrame = \n",df)
# Display the Maximum of Marks in each column 
print("\nMaximum Marks = \n", df.max())

DataFrame = 
    Maths  Science  English
0     90       92       95
1     85       87       94
2     98       59       84
3     80       64       75
4     55       87       67
5     78       96       65

Maximum Marks = 
 Maths      98
Science    96
English    95
dtype: int64


## Pandas min() method

In [31]:
import pandas as pd
# Dataset
data = {
'Maths' : [90, 85, 98, 80, 55, 78], 'Science': [92, 87, 59, 64, 87, 96], 'English': [95, 94, 84, 75, 67, 65]
}
# DataFrame
df = pd.DataFrame(data)
# Display the DataFrame
print("DataFrame = \n", df)
# Display the Minimum of Marks in each column 
print("\nMinimum Marks = \n", df.min())

DataFrame = 
    Maths  Science  English
0     90       92       95
1     85       87       94
2     98       59       84
3     80       64       75
4     55       87       67
5     78       96       65

Minimum Marks = 
 Maths      55
Science    59
English    65
dtype: int64


## Pandas median() method


In [32]:
import pandas as pd
# Dataset
data = {
'Maths': [90, 85, 98, 80, 55, 78],
'Science': [92, 87, 59, 64, 87, 96], 'English': [95, 94, 84, 75, 67, 65]
}
# DataFrame
df = pd.DataFrame(data)
# Display the DataFrame
print("DataFrame = \n", df)
# Display the Median of Marks in each column 
print("\nMedian = \n",df.median())

DataFrame = 
    Maths  Science  English
0     90       92       95
1     85       87       94
2     98       59       84
3     80       64       75
4     55       87       67
5     78       96       65

Median = 
 Maths      82.5
Science    87.0
English    79.5
dtype: float64


## PANDA ISNULL() method

In [34]:
import pandas as pd
# Dataset
data = {
'Maths': [90, 85, 98, None, 55, 78],
'Science': [92, 87, 59, None, None, 96],
'English': [95, None, 84, 75, 67, None]
}
# DataFrame
df = pd.DataFrame(data)
# Display the DataFrame 
print("DataFrame = \n", df)

print("\nCount of non-empty values = \n", df.isnull())

DataFrame = 
    Maths  Science  English
0   90.0     92.0     95.0
1   85.0     87.0      NaN
2   98.0     59.0     84.0
3    NaN      NaN     75.0
4   55.0      NaN     67.0
5   78.0     96.0      NaN

Count of non-empty values = 
    Maths  Science  English
0  False    False    False
1  False    False     True
2  False    False    False
3   True     True    False
4  False     True    False
5  False    False     True


## Indexing and Selecting Data
```
In Pandas, selecting specific rows and columns of data from a DataFrame constitutes indexing.
Selecting all the rows and some of the columns, some of the rows and all the columns, or a portion of each row and each column is what is referred to as indexing.
Another term for indexing is subset selection.
Pandas now supports three types of Multi-axes indexing
```

## Indexing a Data frame using indexing operator [] :
```
This indexer had the ability to select both by integer location and label. Although it was adaptable, its lack of explicitness led to a lot of confusion. Integers can occasionally serve as labels for rows and columns as well. As a result, there were times when it was unclear. In most cases, ix is label-based and performs exactly as the.loc indexer. However,.ix also supports choosing an integer type (like.iloc) when an integer is passed. This only functions when the DataFrame's index is not integer-based.Any.loc and.iloc input is acceptable for ix.
```

In [35]:
# importing pandas package
import pandas as pd

In [36]:
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")

In [37]:
# retrieving columns by indexing operator
first = data["Age"]


In [38]:
print(first)

Name
Avery Bradley    25.0
Jae Crowder      25.0
John Holland     27.0
R.J. Hunter      22.0
Jonas Jerebko    29.0
                 ... 
Shelvin Mack     26.0
Raul Neto        24.0
Tibor Pleiss     26.0
Jeff Withey      26.0
NaN               NaN
Name: Age, Length: 458, dtype: float64


## Indexing a DataFrame using .loc[ ] :

In [39]:
# importing pandas package
import pandas as pd

# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")

# retrieving row by loc method
first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]

print(first, "\n\n\n", second)

Team        Boston Celtics
Number                 0.0
Position                PG
Age                   25.0
Height                 6-2
Weight               180.0
College              Texas
Salary           7730337.0
Name: Avery Bradley, dtype: object 


 Team        Boston Celtics
Number                28.0
Position                SG
Age                   22.0
Height                 6-5
Weight               185.0
College      Georgia State
Salary           1148640.0
Name: R.J. Hunter, dtype: object


## Indexing a DataFrame using .iloc[ ] :

In [42]:
import pandas as pd

# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")

# retrieving rows by iloc method
row2 = data.iloc[0]  

print(row2)

Team        Boston Celtics
Number                 0.0
Position                PG
Age                   25.0
Height                 6-2
Weight               180.0
College              Texas
Salary           7730337.0
Name: Avery Bradley, dtype: object
