# Pandas - So much more than a cute animal


In this lesson, you will learn about the operations in pandas such as indexing while exploring use cases in electricity utility data.

<a href="https://pandas.pydata.org/" target="_blank">Pandas</a> is a library used for data manipulation and built on Numpy with other ways of indexing other than using integers. Series, DataFrame and index are the basic data structures in this library.  Series in pandas can be referred to as a one dimensional array with homogenous elements of different types somewhat similar to numpy arrays however, it can be indexed differently with specified descriptive labels or integers

## Convention for import pandas

In [29]:
import pandas as pd # Manupilating dataframes
import numpy as np #Manipulating arrays and matrices

In [2]:
days = pd.Series(["Monday", "Tuesday", "Wednesday"])
print(days) 

0       Monday
1      Tuesday
2    Wednesday
dtype: object


#### Creating series with a numpy array

In [3]:
days_list = np.array(["Moday","Tuesday", "Wednesday"])
numpy_days = pd.Series(days_list)
print(numpy_days)

0        Moday
1      Tuesday
2    Wednesday
dtype: object


#### Using strings as index

In [4]:
days = pd.Series(["Monday", "Tuesday", "Wednesday"], 
                 index= ["a", "b", "c"])
days


a       Monday
b      Tuesday
c    Wednesday
dtype: object

#### Create series from a dictionary

In [5]:
days1 = pd.Series({"a":"Monday", "b":"Tuesday", "c":"Wednesday"})
days1

a       Monday
b      Tuesday
c    Wednesday
dtype: object

#### Series can be accessed using the speficied index as shown below

In [6]:
days[0]

'Monday'

In [7]:
days["c"]

'Wednesday'

 A DataFrame can be described as a table (2 dimensions) made up of many series with the same index. It holds data in rows and columns just like a spreadsheet. Series, dictionaries, lists other dataframes and numpy arrays can be used to create new ones. 

In [8]:
print(pd.DataFrame())

Empty DataFrame
Columns: []
Index: []


## Create a dataframe from a dictionary

In [9]:
df_dict = {"Country": ["Ghana", "Kenya", "Nigeria", "Togo"],
          "Capital":["Accra", "Nairobi", "Abuja", "Lome"],
          "Population": [10000, 8500, 35000, 12000],
          "Age": [60, 70, 80, 75]}

df_dict

{'Country': ['Ghana', 'Kenya', 'Nigeria', 'Togo'],
 'Capital': ['Accra', 'Nairobi', 'Abuja', 'Lome'],
 'Population': [10000, 8500, 35000, 12000],
 'Age': [60, 70, 80, 75]}

In [10]:
df = pd.DataFrame(df_dict, index= [2, 4, 6, 8])
df


Unnamed: 0,Country,Capital,Population,Age
2,Ghana,Accra,10000,60
4,Kenya,Nairobi,8500,70
6,Nigeria,Abuja,35000,80
8,Togo,Lome,12000,75


In [11]:
df_list = [["Ghana", "Accra", 10000, 60],
          ["Kenya", "Nairobi", 8500, 70],
          ["Nigeria", "Abuja", 35000, 80],
          ["Togo", "Lome", 12000, 75]]

df1 = pd.DataFrame(df_list, columns= ["Country", "Capital", 
                                      "Population", "Age"], 
                  index = [2, 4, 6, 8])

In [12]:
print(df1)

   Country  Capital  Population  Age
2    Ghana    Accra       10000   60
4    Kenya  Nairobi        8500   70
6  Nigeria    Abuja       35000   80
8     Togo     Lome       12000   75


at, iat, iloc and loc are accessors used to retrieve data in dataframes. iloc selects values from the rows and columns by using integer index to locate positions while loc selects row or columns using labels. at and iat are used to retrieve single values such that at uses the column and row labels and iat uses indices. 

### Select the row in the at index 3

In [13]:
df.iloc[3]

Country        Togo
Capital        Lome
Population    12000
Age              75
Name: 8, dtype: object

### Select the row with  index label 6

In [14]:
df.loc[6]

Country       Nigeria
Capital         Abuja
Population      35000
Age                80
Name: 6, dtype: object

### Select the Capital columns

In [15]:
df["Capital"]

2      Accra
4    Nairobi
6      Abuja
8       Lome
Name: Capital, dtype: object

#### at works by giving the index to python

In [16]:
df.at[6, "Country"]

'Nigeria'

In [17]:
df.at[8, "Country"]

'Togo'

In [18]:
df.iat[2, 0]

'Nigeria'

In [19]:
df.iat[2, 2]

35000

### Select the row in the at index 3

In [20]:
df.iloc[3]

Country        Togo
Capital        Lome
Population    12000
Age              75
Name: 8, dtype: object

### Select the row with index label 6

In [21]:
df.loc[6]

Country       Nigeria
Capital         Abuja
Population      35000
Age                80
Name: 6, dtype: object

Finally, Indexes in pandas are immutable arrays with unique elements or can be described as ordered sets for retrieving data in a dataframe and collaborating with multiple dataframes.
The important Pandas functionalities: indexing, reindexing, selection, group, drop entities, ranking, sorting, duplicates and indexing by hierarchy.
Summary and descriptive statistics: measure of central tendency, measure of dispersion, skewness and kurtosis, correlation and multicollinearity.
Similar to Numpy, Pandas has some functions that provide descriptive statistics such as the measures of central tendency, dispersion, skewness and kurtosis, correlation and multicollinearity. Some functions are mode(), median(), mean(), sum(), std(), var(), skew(), kurt() and min(). The describe function gives the summary  of the numeric columns in a dataframe displaying count, mean, standard deviation, interquartile range, minimum and maximum values.

#### PS:  Check out the <a href="https://pandas.pydata.org/" target="_blank">documentation</a>  for more

In [22]:
df["Population"].sum()

65500

In [23]:
df.mean()

Population    16375.00
Age              71.25
dtype: float64

In [24]:
df.describe()

Unnamed: 0,Population,Age
count,4.0,4.0
mean,16375.0,71.25
std,12499.166639,8.539126
min,8500.0,60.0
25%,9625.0,67.5
50%,11000.0,72.5
75%,17750.0,76.25
max,35000.0,80.0


### The missing data enigma: Importance, types and handling missing data.

Often data used for analysis in real life scenarios is incomplete as a result of omission, faulty devices and many other factors. Pandas represent missing values as NA or NaN which can be filled, removed and detected with functions like fillna(), dropna(), isnull(), notnull(), replace().

In [25]:
df_dict2 = {"Name": ["James", "Yemen", "Caro", np.nan],
           "Profession": ["Researcher", "Artist", "Doctor", "Writer"],
           "Experience":[12, np.nan, 10, 8],
           "Height": [np.nan, 175, 180, 150]}

In [26]:
new_df = pd.DataFrame(df_dict2)
new_df

Unnamed: 0,Name,Profession,Experience,Height
0,James,Researcher,12.0,
1,Yemen,Artist,,175.0
2,Caro,Doctor,10.0,180.0
3,,Writer,8.0,150.0


#### Check for cells with missing values as True

In [27]:
new_df.isnull()

Unnamed: 0,Name,Profession,Experience,Height
0,False,False,False,True
1,False,False,True,False
2,False,False,False,False
3,True,False,False,False


##### Remove rows with missing values

In [28]:
new_df.dropna()

Unnamed: 0,Name,Profession,Experience,Height
2,Caro,Doctor,10.0,180.0
