### Pandas - So much more than a cute animal

Pandas is a library used for `data manipulation` and built on Numpy with other ways of `indexing` other than using integers. `Series`, `DataFrame` and `index` are the basic data structures in this library.  `Series` in pandas can be referred to as a one dimensional array with `homogenous elements` of different types somewhat similar to numpy arrays however, it can be indexed differently with specified descriptive labels or integers.

#### convention for importing pandas

In [2]:
# import necessary library
import pandas as pd
import numpy as np

In [3]:
days = pd.Series(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'])
print(days)

0       Monday
1      Tuesday
2    Wednesday
3     Thursday
4       Friday
dtype: object


In [4]:
# creating series with a numpy array

import numpy as np
days_list = np.array(['Dog', 'Elephant', 'Tiger', 'Donkey', 'Animals'])
print(days_list)

['Dog' 'Elephant' 'Tiger' 'Donkey' 'Animals']


In [5]:
# Covert the above array to numpy series

numpy_series = pd.Series(days_list)
print(numpy_series)

0         Dog
1    Elephant
2       Tiger
3      Donkey
4     Animals
dtype: object


In [6]:
print(type(numpy_series))

<class 'pandas.core.series.Series'>


In [7]:
# using series as index

df_days = pd.Series(['Sillians', 'Achugo', 'Uzor', 'Olaniyi'],  index=['a', 'b', 'c', 'd'])
print(df_days)

a    Sillians
b      Achugo
c        Uzor
d     Olaniyi
dtype: object


In [8]:
# create series from a dictionary

days_dict = pd.Series({'a': 'Sillians', 'b': 'Achugo', 'c': 'Uzor', 'd': 'Olaniyi'})
print(days_dict)

a    Sillians
b      Achugo
c        Uzor
d     Olaniyi
dtype: object


In [9]:
print(type(days_dict))

<class 'pandas.core.series.Series'>


Series can be accessed using the specified index as shown below

In [10]:
days_dict[2:]

c       Uzor
d    Olaniyi
dtype: object

In [11]:
data = {"country": ["Brazil", "Russia", "India", "China", "South Africa"],
        "capital": ["Brasilia", "Moscow", "New Dehli", "Beijing", "Pretoria"],
        "area": [8.516, 17.10, 3.286, 9.597, 1.221],
        "population": [200.4, 143.5, 1252, 1357, 52.98]}

frame = pd.DataFrame(data)
print(frame)

        country    capital    area  population
0        Brazil   Brasilia   8.516      200.40
1        Russia     Moscow  17.100      143.50
2         India  New Dehli   3.286     1252.00
3         China    Beijing   9.597     1357.00
4  South Africa   Pretoria   1.221       52.98


In [12]:
frame.iloc[3:4, 3:]

Unnamed: 0,population
3,1357.0


In [34]:
frame.index = ["BR", "RU", "IN", "CH", "SA"]
print(frame)

         country    capital    area  population
BR        Brazil   Brasilia   8.516      200.40
RU        Russia     Moscow  17.100      143.50
IN         India  New Dehli   3.286     1252.00
CH         China    Beijing   9.597     1357.00
SA  South Africa   Pretoria   1.221       52.98


In [35]:
# select the row in the at index 3
frame.iloc[3]

country         China
capital       Beijing
area            9.597
population       1357
Name: CH, dtype: object

In [41]:
# select the Capital column
frame['capital']

BR     Brasilia
RU       Moscow
IN    New Dehli
CH      Beijing
SA     Pretoria
Name: capital, dtype: object

A DataFrame can be described as a table (2 dimensions) made up of many series with the same index. It holds data in rows and columns just like a spreadsheet.  Series, dictionaries, lists other dataframes and numpy arrays can be used to create new ones.


In [14]:
pd.DataFrame()

In [15]:
# create a dataframe from a dictionary
df_dict = {'Country': ['Ghana', 'Kenya', 'Nigeria', 'Togo'],
           'Capital': ['Accra', 'Nairobi', 'Abuja', 'Lome'],
           'Population': [10000, 8500, 35000, 12000],
           'Age': [60, 70, 80, 75]
}


df = pd.DataFrame(df_dict, index=[2, 4, 6, 8])
# df = pd.DataFrame(df_dict, columns=[''])
print(df)

   Country  Capital  Population  Age
2    Ghana    Accra       10000   60
4    Kenya  Nairobi        8500   70
6  Nigeria    Abuja       35000   80
8     Togo     Lome       12000   75


In [16]:
df_list = [['Ghana', 'Accra', 10000, 60], 
           ['Kenya', 'Nairobi', 8500, 70], 
           ['Nigeria',   'Abuja', 35000, 80], 
           ['Togo', 'Lome', 12000, 75]]

df1 = pd.DataFrame(df_list, columns=['Country', 'Capital','Population', 'Age'], 
                   index=[3, 5, 7, 9])
print(df1)

   Country  Capital  Population  Age
3    Ghana    Accra       10000   60
5    Kenya  Nairobi        8500   70
7  Nigeria    Abuja       35000   80
9     Togo     Lome       12000   75


`at`, `iat`, `iloc` and `loc` are accessors used to retrieve data in dataframes. `iloc` selects values from the rows and columns by using integer index to locate positions while `loc` selects row or columns using labels. `at` and `iat` are used to retrieve single values such that at uses the column and row labels and iat uses indices. 

In [17]:
df1.loc[3]

Country       Ghana
Capital       Accra
Population    10000
Age              60
Name: 3, dtype: object

In [18]:
df1.iloc[3]

Country        Togo
Capital        Lome
Population    12000
Age              75
Name: 9, dtype: object

In [19]:
df1['Country']

3      Ghana
5      Kenya
7    Nigeria
9       Togo
Name: Country, dtype: object

In [20]:
df1.at[7, 'Age']

80

In [21]:
df1.iat[3, 3]

75

Finally, Indexes in pandas are immutable arrays with unique elements or can be described as ordered sets for retrieving data in a dataframe and collaborating with multiple dataframes.

The important Pandas functionalities: indexing, reindexing, selection, group, drop entities, ranking, sorting, duplicates and indexing by hierarchy.

Summary and descriptive statistics: measure of central tendency, measure of dispersion, skewness and kurtosis, correlation and multicollinearity

Similar to Numpy, Pandas has some functions that provide descriptive statistics such as the measures of `central tendency`, `dispersion`, `skewness` and `kurtosis`, `correlation` and `multicollinearity`. Some functions are `mode()`, `median()`, `mean()`, `sum()`, `std()`, `var()`, `skew()`, `kurt()` and `min()`. The `describe` function gives the summary  of the numeric columns in a dataframe displaying `count`, `mean`, `standard deviation`, `interquartile range`, `minimum` and `maximum values`.

In [22]:
df1['Population'].mean()

16375.0

In [23]:
df1['Age'].sum()

285

In [24]:
df1.mean()

Population    16375.00
Age              71.25
dtype: float64

In [25]:
df1.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Population,4.0,16375.0,12499.166639,8500.0,9625.0,11000.0,17750.0,35000.0
Age,4.0,71.25,8.539126,60.0,67.5,72.5,76.25,80.0


In [26]:
df1['Age'].mode()

0    60
1    70
2    75
3    80
dtype: int64

In [27]:
df1.std()

Population    12499.166639
Age               8.539126
dtype: float64

In [28]:
df1.skew()

Population    1.921968
Age          -0.752837
dtype: float64

In [29]:
df1.kurtosis()

Population    3.734702
Age           0.342857
dtype: float64

The missing data enigma: `Importance`, `types` and `handling missing data`.

Often data used for `analysis` in real life scenarios is incomplete as a result of omission, `faulty devices` and many other factors. `Pandas` represent missing values as `NA` or `NaN` which can be filled, removed and detected with functions like `fillna()`, `dropna()`, `isnull()`, `notnull()`, `replace()`.

In [30]:
df_dict2 = {'Name': ['James', 'Yemen', 'Caro', np.nan],
           'Profession': ['Researcher', 'Artist', 'Doctor', 'Writer'],
           'Experience': [12, np.nan, 10, 8],
           'Height': [np.nan, 175, 180, 150]}

new_df = pd.DataFrame(df_dict2)
print(new_df)

    Name  Profession  Experience  Height
0  James  Researcher        12.0     NaN
1  Yemen      Artist         NaN   175.0
2   Caro      Doctor        10.0   180.0
3    NaN      Writer         8.0   150.0


In [31]:
new_df.isnull()

Unnamed: 0,Name,Profession,Experience,Height
0,False,False,False,True
1,False,False,True,False
2,False,False,False,False
3,True,False,False,False


In [32]:
new_df.isnull().sum()

Name          1
Profession    0
Experience    1
Height        1
dtype: int64

In [33]:
new_df.dropna(axis=0, inplace=True)
print(new_df)

   Name Profession  Experience  Height
2  Caro     Doctor        10.0   180.0
