## Introduction to Python libaries- Pandas 


#### PYTHON PANDAS

This is a multidimensional data structures and analysis tool for manipulating numerical 

>Note: Rows represent **observations** while columns represent **input features**

**Pandas Data Type**

Recognised pandas data type includes:

* **object:**     To represent text
* **int64:**      Integer values
* **float64:**    Floating point numbers
* **Category:**   List of text values
* **bool:**       True or false values
* **datetime64:** Date and time values
* **timedelta:**  Difference between two datetimes

In [1]:
import numpy as np
import pandas as pd



Ways to create pandas dataframe

In [30]:
# initialize list of lists 
data = [['Ayo', 10], ['Imran', 15], ['Chucks', 14]] 

# Create the pandas DataFrame from the list and adding column headers
df = pd.DataFrame(data, columns = ['Name', 'Age']) 

# print dataframe. 
df 


Unnamed: 0,Name,Age
0,Ayo,10
1,Imran,15
2,Chucks,14


In [31]:
# Create the pandas DataFrame from the dictionary of narray list
#Example 1:
# initialize list of lists 
data = {'Name': ['Ayo', 'Imran','Chucks'] ,'Age':[10, 15, 14]}

# Create the pandas DataFrame from the list and adding column headers
df = pd.DataFrame(data) 

# print dataframe. 
df 


Unnamed: 0,Age,Name
0,10,Ayo
1,15,Imran
2,14,Chucks


In [32]:
#Example 2:

#Population and area (km/square) of some states in Nigeria and their capital

dict_data = {"State": ["Abia", "Adamawa", "Lagos", "Osun", "Rivers"], 
       "Capital": ["Umuahia", "Yola", "Ikeja", "Osogbo", "Portharcourt"], 
       "area": [6320, 36917, 3345, 9251, 11077], 
       "population": [2845380, 3178950, 9113605, 3416959, 5198605] } 
  
df = pd.DataFrame(dict_data) 

df 

Unnamed: 0,Capital,State,area,population
0,Umuahia,Abia,6320,2845380
1,Yola,Adamawa,36917,3178950
2,Ikeja,Lagos,3345,9113605
3,Osogbo,Osun,9251,3416959
4,Portharcourt,Rivers,11077,5198605


In [34]:
df.dtypes

Capital       object
State         object
area           int64
population     int64
dtype: object

**ZIP**

In [35]:
# pandas Datadaframe from lists using zip. 

# List1 
Name = ['Ayo', 'Imran','Chucks', 'judith'] 

# List2 
Age = [25, 30, 26, 22] 

# get the list of tuples from two list and merge them by using zip(). 
list_of_tuples = list(zip(Name, Age)) 

# Converting lists of tuples into pandas Dataframe. 
df = pd.DataFrame(list_of_tuples, columns = ['Name', 'Age']) 

# Print data. 
df 


Unnamed: 0,Name,Age
0,Ayo,25
1,Imran,30
2,Chucks,26
3,judith,22


**SERIES**

A Series represents a single column in memory, which is either independent or belongs to a Pandas DataFrame.

In [36]:
# Pandas Dataframe from Dicts of series. 

import pandas as pd 

# Intialise data to Dicts of series. 
series_data = {"State": pd.Series(["Abia", "Adamawa", "Lagos", "Osun", "Rivers"]), 
       "Capital": pd.Series(["Umuahia", "Yola", "Ikeja", "Osogbo", "Portharcourt"]), 
       "area": pd.Series([6320, 36917, 3345, 9251, 11077]), 
       "population": pd.Series([2845380, 3178950, 9113605, 3416959, 5198605]) } 
  
# creates Dataframe. 
df = pd.DataFrame(series_data) 

# print the data. 
df 


Unnamed: 0,Capital,State,area,population
0,Umuahia,Abia,6320,2845380
1,Yola,Adamawa,36917,3178950
2,Ikeja,Lagos,3345,9113605
3,Osogbo,Osun,9251,3416959
4,Portharcourt,Rivers,11077,5198605


### Creating a Series by converting a list, dictionary, array to a Series

In [2]:
labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20, 'c':30}

In [3]:
my_list

[10, 20, 30]

In [4]:
pd.Series(data = my_list)

0    10
1    20
2    30
dtype: int64

In [5]:
pd.Series(data=my_list, index = labels)

a    10
b    20
c    30
dtype: int64

In [6]:
pd.Series(my_list, labels)

a    10
b    20
c    30
dtype: int64

In [7]:
pd.Series(d)

a    10
b    20
c    30
dtype: int64

In [8]:
pd.Series(labels)

0    a
1    b
2    c
dtype: object

In [9]:
pd.Series(arr)

0    10
1    20
2    30
dtype: int32

In [10]:
pd.Series(arr,labels)

a    10
b    20
c    30
dtype: int32

**External source -**

**CSV** Another way to create a DataFrame is by importing a csv file using pd.read_csv



In [37]:
csv_df = pd.read_csv('Data/2006.csv')

csv_df

Unnamed: 0,STATES,AREA (km2),Population
0,Abia State,6320,2845380
1,Adamawa State,36917,3178950
2,Akwa Ibom State,7081,3178950
3,Anambra State,4844,4177828
4,Bauchi State,45837,4653066
5,Bayelsa State,10773,1704515
6,Benue State,34059,4253641
7,Borno State,70898,4171104
8,Cross River,20156,2892988
9,Delta State,17698,4112445


**EXCEL- XLSX**

In [38]:
Excel_df = pd.read_excel('Data/2006.xlsx')

Excel_df

Unnamed: 0,STATES,AREA (km2),Population
0,Abia State,6320,2845380
1,Adamawa State,36917,3178950
2,Akwa Ibom State,7081,3178950
3,Anambra State,4844,4177828
4,Bauchi State,45837,4653066
5,Bayelsa State,10773,1704515
6,Benue State,34059,4253641
7,Borno State,70898,4171104
8,Cross River,20156,2892988
9,Delta State,17698,4112445


In [39]:
#By default, if no length is specified, it returns the first 5 rows
print(csv_df.head(), '\n')

#This returns the first 5 rows in Population Column
print(csv_df['Population'].head())

            STATES  AREA (km2)   Population
0       Abia State         6320     2845380
1    Adamawa State        36917     3178950
2  Akwa Ibom State         7081     3178950
3    Anambra State         4844     4177828
4     Bauchi State        45837     4653066 

0    2845380
1    3178950
2    3178950
3    4177828
4    4653066
Name: Population, dtype: int64


In [40]:
#By default, if no length is specified, it returns the last 5 rows
print(csv_df.tail(), '\n')

#This returns the last 5 rows in Population Column
print(csv_df['Population'].tail())

           STATES  AREA (km2)   Population
32   Rivers State        11077     5198605
33   Sokoto State        25973     3702676
34   Taraba State        54473     2294800
35     Yobe State        45502     2321339
36  Zamfara State        39762     3278873 

32    5198605
33    3702676
34    2294800
35    2321339
36    3278873
Name: Population, dtype: int64


In [41]:
#For summary of descriptive statistics of the dataframe
csv_df.describe()

Unnamed: 0,AREA (km2),Population
count,37.0,37.0
mean,24990.864865,3775879.0
std,18243.870444,1726418.0
min,3345.0,1405201.0
25%,9251.0,2845380.0
50%,20156.0,3314043.0
75%,36800.0,4177828.0
max,76363.0,9401288.0


In [42]:
#To include summary of descriptive statistics of non numeric columns of the dataframe 
csv_df.describe(include='all')

Unnamed: 0,STATES,AREA (km2),Population
count,37,37.0,37.0
unique,37,,
top,Ekiti State,,
freq,1,,
mean,,24990.864865,3775879.0
std,,18243.870444,1726418.0
min,,3345.0,1405201.0
25%,,9251.0,2845380.0
50%,,20156.0,3314043.0
75%,,36800.0,4177828.0


In [43]:
csv_df['Population'].mean()

3775879.4594594594

Other descriptive statistics functions are:

* count()	Number of non-null observations
* sum()	Sum of values
* mean()	Mean of Values
* median()	Median of Values
* mode()	Mode of values
* std()	Standard Deviation of the Values
* min()	Minimum Value
* max()	Maximum Value
* abs()	Absolute Value
* prod()	Product of Values
* cumsum()	Cumulative Sum
* cumprod()	Cumulative Product

>Note: Functions like abs(), cumprod() throw exception when the DataFrame contains character or string data because such operations cannot be performed.

In [44]:
#To show the features in the dataset
csv_df.columns

Index(['STATES', 'AREA (km2) ', 'Population'], dtype='object')

In [45]:
#To show even more information about the dataset

csv_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 3 columns):
STATES         37 non-null object
AREA (km2)     37 non-null int64
Population     37 non-null int64
dtypes: int64(2), object(1)
memory usage: 968.0+ bytes


### HTML

In [12]:
# data = pd.read_html("url") # This is to read an html file

#### Data Output


In [13]:
# data.to_csv("My Output",)

for further reading:
    
https://pbpython.com/pandas_dtypes.html

https://en.wikipedia.org/wiki/Matrix_(mathematics)

https://www.geeksforgeeks.org/best-python-libraries-for-machine-learning/