# Intro to Pandas

Pandas is a Python package for data analysis and exposes two new
data structures: Dataframes and Series.

- [Dataframes](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) store tabular data consisting of rows and columns.
- [Series](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html) are similar to Python's built-in list or set data types.

In this notebook, we will explore the data structures that Pandas
provides, and learn how to interact with them.

### 1. Importing Pandas

To import an external Python library such as Pandas, use Python's
import function. To save yourself some typing later on, you can
give the library you import an alias. Here, we are importing Pandas
and giving it an alias of `pd`.

In [1]:
import os

import pandas as pd

### 2. Creating A Dataframe and Basic Exploration
We will load a CSV file as a dataframe using Panda's `read_csv`
method. This will allow us to use Pandas' dataframe functions to
explore the data in the CSV.

In [2]:
path = '../data/'
fileName = 'loans_full_africa'

# Git command string, for upcoming 'exception'.
gitComand_string = "git clone --single-branch --depth=1"
gitRepoLocation_string = "https://github.com/DeltaAnalytics/machine_learning_for_good_data ../data"
installFromRepo_string = gitComand_string + ' ' + gitRepoLocation_string

# If we cannot download the file(s) from github, we download it inside
# this repository.
try:
    df = pd.read_csv(f"{path}{fileName}.zip")
except:
    os.system(installFromRepo_string)

# If the file is already opened as a CSV file, then import as such.
try:
    df = pd.read_csv(f"{path}{fileName}.zip")
except:
    df = pd.read_csv(f"{path}{fileName}.csv")

Once we have loaded the CSV as a dataframe, we can start to explore
the data.  Here are a few useful methods:
    - .head(): returns first 5 rows of the DataFrame
    - .tail(): returns last 5 rows of the DataFrame
    - .shape: returns tuple with first element indicating the number of rows and the second element indicating the number of columns
    - .columns: returns list of all columns in DataFrame
    - .index: returns DataFrame indices
    - .dtypes: returns Series explaining the datatype of each column

In [3]:
df.shape

(5623, 52)

This function allow you to see the shape of your dataset

In [4]:
df.dtypes

activity                                                 object
basket_amount                                           float64
bonus_credit_eligibility                                   bool
borrower_count                                            int64
borrowers                                                object
currency_exchange_loss_amount                           float64
description_languages                                    object
description_texts_en                                     object
description_texts_es                                     object
description_texts_fr                                     object
description_texts_pt                                     object
funded_amount                                             int64
funded_date                                              object
id                                                        int64
image_id                                                  int64
image_template_id                       

To get some basic stats of the columns you can either use `.describe()` for discrete data or `.value_counts` for categroical data

In [5]:
df.describe()

Unnamed: 0,basket_amount,borrower_count,currency_exchange_loss_amount,funded_amount,id,image_id,image_template_id,journal_totals_bulkEntries,journal_totals_entries,lender_count,loan_amount,partner_id,terms_disbursal_amount,terms_loan_amount,terms_loss_liability_currency_exchange_coverage_rate,terms_repayment_term,translator_image,video_id,video_thumbnailImageId
count,716.0,5623.0,216.0,5623.0,5623.0,5623.0,5623.0,5623.0,5623.0,5623.0,5623.0,5623.0,5623.0,5623.0,5449.0,5623.0,2690.0,3.0,3.0
mean,5.342179,4.601103,37.109352,1312.177663,1386572.0,2544438.0,1.0,0.0,0.0,34.835319,1462.266584,266.030944,896414.2,1462.266584,0.1,11.947359,1658052.0,2833.0,1342795.0
std,35.874513,6.744383,62.173902,2773.545206,402558.8,760916.7,0.0,0.0,0.0,84.17451,2951.94445,148.667976,1760933.0,2951.94445,2.775812e-17,9.09144,900967.5,112.663215,205054.4
min,0.0,1.0,0.09,0.0,13772.0,47844.0,1.0,0.0,0.0,0.0,50.0,23.0,100.0,50.0,0.1,4.0,28733.0,2724.0,1183944.0
25%,0.0,1.0,3.5425,200.0,1280366.0,2451692.0,1.0,0.0,0.0,6.0,300.0,160.0,7980.0,300.0,0.1,8.0,839627.0,2775.0,1227052.0
50%,0.0,1.0,9.125,500.0,1599478.0,2935550.0,1.0,0.0,0.0,16.0,625.0,222.0,110000.0,625.0,0.1,10.0,1566276.0,2826.0,1270161.0
75%,0.0,5.0,31.375,1450.0,1621750.0,2966605.0,1.0,0.0,0.0,39.0,1600.0,422.0,1200000.0,1600.0,0.1,14.0,2597284.0,2887.5,1422220.0
max,600.0,43.0,302.39,80000.0,1629761.0,2978281.0,1.0,0.0,0.0,2665.0,100000.0,587.0,38000000.0,100000.0,0.1,133.0,2979421.0,2949.0,1574280.0


The following allows you to see the sum of missing value in your dataset.

In [6]:
df.isnull().sum().value_counts()

0       34
5620     3
342      2
765      1
2170     1
5622     1
2933     1
5423     1
174      1
4907     1
2006     1
5407     1
152      1
21       1
4042     1
834      1
dtype: int64

Alternatively, if you want just the count or min / max of one column, you can use Pandas built in functions:

In [7]:
print('Feature Length: ', len(df['borrower_count']))
print('Max Funded Amount: ', max(df['funded_amount']))
print('Average Loan Amount: ', df['loan_amount'].mean())

Feature Length:  5623
Max Funded Amount:  80000
Average Loan Amount:  1462.2665836741953


And if you want a quick preview, we can use `.head` function to see the, default, top 5 rows of the dataset.

In [8]:
df.head()

Unnamed: 0,activity,basket_amount,bonus_credit_eligibility,borrower_count,borrowers,currency_exchange_loss_amount,description_languages,description_texts_en,description_texts_es,description_texts_fr,...,terms_repayment_interval,terms_repayment_term,terms_scheduled_payments,themes,translator_byline,translator_image,use,video_id,video_thumbnailImageId,video_title
0,Food Stall,,False,18,"[{'first_name': 'Seconde', 'last_name': '', 'g...",,"['fr', 'en']",Marie is a member of the TujimbereI group and ...,,Marie fait partie du groupe TujimbereI et habi...,...,Monthly,8,[],['Conflict Zones'],Marie Mintalucci,,to increase her capital and buy cassava and ba...,,,
1,Retail,,False,13,"[{'first_name': 'J Pierre', 'last_name': '', '...",,"['fr', 'en']",Evelyne is a member of the Ntitugungane grou...,,Evelyne fait partie du groupe Ntitugungane et...,...,Monthly,10,[],['Conflict Zones'],Katharina S,340594.0,"to buy bananas for wine production, increasing...",,,
2,Butcher Shop,,False,17,"[{'first_name': 'Léopold', 'last_name': '', 'g...",,"['fr', 'en']",Léopold is part of the group called Umugogo an...,,Léopold fait partie du groupe Umugogo et habi...,...,Monthly,8,[],['Conflict Zones'],Daniel Kuey,2835726.0,to purchase goats to resell for meat and earn ...,,,
3,Food,,False,30,"[{'first_name': 'Joseph', 'last_name': '', 'ge...",,"['fr', 'en']",Joseph belongs to the group Kanga and lives in...,,Joseph fait partie du groupe Kanga et habite ...,...,Monthly,8,[],['Conflict Zones'],melanie fluharty,2286027.0,to increase his capital and to purchase palm t...,,,
4,Food,,False,26,"[{'first_name': 'Virginie', 'last_name': '', '...",,"['fr', 'en']",Virginie is a member of the Banguka I group an...,,Virginie fait partie du groupe Banguka I et h...,...,Monthly,8,[],['Conflict Zones'],,,to buy bananas for resale.,,,


Lastly, if you want to see the occurence of values in a column (Series), you can use the Pandas `value_counts` function.

In [9]:
df['activity'].value_counts()

Retail              522
Farming             500
General Store       221
Food                206
Clothing Sales      201
                   ... 
Rickshaw              1
Technology            1
Event Planning        1
Waste Management      1
Machinery Rental      1
Name: activity, Length: 134, dtype: int64

### 3. Selecting Data
To examine a specfic column of the DataFrame either at the beginning or end, we respectively see the following:

In [10]:
df['activity'].head()

0      Food Stall
1          Retail
2    Butcher Shop
3            Food
4            Food
Name: activity, dtype: object

In [11]:
df[['activity','basket_amount']].tail()

Unnamed: 0,activity,basket_amount
5618,Shoe Sales,
5619,Retail,
5620,Livestock,
5621,Retail,
5622,Retail,




To examine specific rows and columns of a Dataframe, Pandas provides
the `iloc` and `loc` methods to do so.  `iloc` is used when you want to specify a list or range of indices, and `.loc` is used when you want to specify a list or range of labels.  

For both of these methods you need to specify two elements, with the first element indicating the rows that you want to select and the second element indicating the columns that you want to select.

In [12]:
# Get rows 1 through 3 and columns 0 through 5.
df.iloc[1:3,:5]

Unnamed: 0,activity,basket_amount,bonus_credit_eligibility,borrower_count,borrowers
1,Retail,,False,13,"[{'first_name': 'J Pierre', 'last_name': '', '..."
2,Butcher Shop,,False,17,"[{'first_name': 'Léopold', 'last_name': '', 'g..."


In [13]:
# Get rows with index values of 2-4 and the columns basket_amount and activity
df.loc[2:4, ["basket_amount", "activity"]]

Unnamed: 0,basket_amount,activity
2,,Butcher Shop
3,,Food
4,,Food


What do you notice about the way the indices work for `iloc` versus `loc`?

In [14]:
# To see all the rows and columns:
# Note: [remove the .head() to see it all]
df.iloc[:,:].head()


Unnamed: 0,activity,basket_amount,bonus_credit_eligibility,borrower_count,borrowers,currency_exchange_loss_amount,description_languages,description_texts_en,description_texts_es,description_texts_fr,...,terms_repayment_interval,terms_repayment_term,terms_scheduled_payments,themes,translator_byline,translator_image,use,video_id,video_thumbnailImageId,video_title
0,Food Stall,,False,18,"[{'first_name': 'Seconde', 'last_name': '', 'g...",,"['fr', 'en']",Marie is a member of the TujimbereI group and ...,,Marie fait partie du groupe TujimbereI et habi...,...,Monthly,8,[],['Conflict Zones'],Marie Mintalucci,,to increase her capital and buy cassava and ba...,,,
1,Retail,,False,13,"[{'first_name': 'J Pierre', 'last_name': '', '...",,"['fr', 'en']",Evelyne is a member of the Ntitugungane grou...,,Evelyne fait partie du groupe Ntitugungane et...,...,Monthly,10,[],['Conflict Zones'],Katharina S,340594.0,"to buy bananas for wine production, increasing...",,,
2,Butcher Shop,,False,17,"[{'first_name': 'Léopold', 'last_name': '', 'g...",,"['fr', 'en']",Léopold is part of the group called Umugogo an...,,Léopold fait partie du groupe Umugogo et habi...,...,Monthly,8,[],['Conflict Zones'],Daniel Kuey,2835726.0,to purchase goats to resell for meat and earn ...,,,
3,Food,,False,30,"[{'first_name': 'Joseph', 'last_name': '', 'ge...",,"['fr', 'en']",Joseph belongs to the group Kanga and lives in...,,Joseph fait partie du groupe Kanga et habite ...,...,Monthly,8,[],['Conflict Zones'],melanie fluharty,2286027.0,to increase his capital and to purchase palm t...,,,
4,Food,,False,26,"[{'first_name': 'Virginie', 'last_name': '', '...",,"['fr', 'en']",Virginie is a member of the Banguka I group an...,,Virginie fait partie du groupe Banguka I et h...,...,Monthly,8,[],['Conflict Zones'],,,to buy bananas for resale.,,,


In [15]:
# You can also store a slice of the dataframe as a new dataframe!
titles_df = df.iloc[:,2]
titles_df.head()

0    False
1    False
2    False
3    False
4    False
Name: bonus_credit_eligibility, dtype: bool

### 4. Select subets of the DataFrame

A powerful feature of DataFrames is that you can view a subset of the DataFrame based on the values of the columns or rows. 

For example, lets say you only wanted to view loans with a status of "expired"

In [16]:
df[df['status']=='expired'].head()

Unnamed: 0,activity,basket_amount,bonus_credit_eligibility,borrower_count,borrowers,currency_exchange_loss_amount,description_languages,description_texts_en,description_texts_es,description_texts_fr,...,terms_repayment_interval,terms_repayment_term,terms_scheduled_payments,themes,translator_byline,translator_image,use,video_id,video_thumbnailImageId,video_title
18,Food,,False,24,"[{'first_name': 'Tharcisse', 'last_name': '', ...",,"['fr', 'en']",Evariste is part of the Mageyo group and lives...,,Evariste fait partie du groupe Mageyo et habi...,...,Monthly,10,[],['Conflict Zones'],leonardo,1277720.0,to increase his capital and purchase a large q...,,,
20,General Store,,False,18,"[{'first_name': 'Emmanuel', 'last_name': '', '...",,"['fr', 'en']",Egide is part of the Yagurukundo group and liv...,,Egide fait partie du groupe Yagurukundo et h...,...,Monthly,9,[],['Conflict Zones'],leonardo,1277720.0,"to increase their capital and buy rice, beans,...",,,
21,Food,,False,26,"[{'first_name': 'Elie', 'last_name': '', 'gend...",,"['fr', 'en']",Isidore is a member of the Butanuka group and ...,,Isidore fait partie du groupe Butanuka et hab...,...,Monthly,9,[],['Conflict Zones'],Katharina S,340594.0,to buy palm oil for resale in order to earn more.,,,
24,Clothing Sales,,False,26,"[{'first_name': 'Aline', 'last_name': '', 'gen...",,"['fr', 'en']",Alexis is part of the Gitwe-Twitezimbere group...,,Alexis fait partie du groupe Gitwe-Twitezimber...,...,Monthly,10,[],['Conflict Zones'],leonardo,1277720.0,to increase his capital and buy clothing to re...,,,
25,Butcher Shop,,False,26,"[{'first_name': 'Spéciose', 'last_name': '', '...",,"['fr', 'en']",Marcien is a member of the group called Gitwe-...,,Marcien fait partie du groupe Gitwe-Twitezimbe...,...,Monthly,10,[],['Conflict Zones'],Teresa Kramer,1940938.0,to grow his working capital and buy a pig to s...,,,


To view all loans with a status of "expired" `or` "fundraising", we use a pipe`|` character:

In [17]:
df[(df['status']=='expired')|(df['status']=='fundraising')]

Unnamed: 0,activity,basket_amount,bonus_credit_eligibility,borrower_count,borrowers,currency_exchange_loss_amount,description_languages,description_texts_en,description_texts_es,description_texts_fr,...,terms_repayment_interval,terms_repayment_term,terms_scheduled_payments,themes,translator_byline,translator_image,use,video_id,video_thumbnailImageId,video_title
18,Food,,False,24,"[{'first_name': 'Tharcisse', 'last_name': '', ...",,"['fr', 'en']",Evariste is part of the Mageyo group and lives...,,Evariste fait partie du groupe Mageyo et habi...,...,Monthly,10,[],['Conflict Zones'],leonardo,1277720.0,to increase his capital and purchase a large q...,,,
20,General Store,,False,18,"[{'first_name': 'Emmanuel', 'last_name': '', '...",,"['fr', 'en']",Egide is part of the Yagurukundo group and liv...,,Egide fait partie du groupe Yagurukundo et h...,...,Monthly,9,[],['Conflict Zones'],leonardo,1277720.0,"to increase their capital and buy rice, beans,...",,,
21,Food,,False,26,"[{'first_name': 'Elie', 'last_name': '', 'gend...",,"['fr', 'en']",Isidore is a member of the Butanuka group and ...,,Isidore fait partie du groupe Butanuka et hab...,...,Monthly,9,[],['Conflict Zones'],Katharina S,340594.0,to buy palm oil for resale in order to earn more.,,,
24,Clothing Sales,,False,26,"[{'first_name': 'Aline', 'last_name': '', 'gen...",,"['fr', 'en']",Alexis is part of the Gitwe-Twitezimbere group...,,Alexis fait partie du groupe Gitwe-Twitezimber...,...,Monthly,10,[],['Conflict Zones'],leonardo,1277720.0,to increase his capital and buy clothing to re...,,,
25,Butcher Shop,,False,26,"[{'first_name': 'Spéciose', 'last_name': '', '...",,"['fr', 'en']",Marcien is a member of the group called Gitwe-...,,Marcien fait partie du groupe Gitwe-Twitezimbe...,...,Monthly,10,[],['Conflict Zones'],Teresa Kramer,1940938.0,to grow his working capital and buy a pig to s...,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5458,Retail,50.0,True,14,"[{'first_name': 'Ansarou ', 'last_name': '', '...",,"['fr', 'en']",The banc villageois of which this group of 14 ...,,Le banc villageois dont fait partie ce groupe ...,...,At end of term,8,"[{'due_date': '2018-12-01T08:00:00Z', 'amount'...",,Jennifer Anderson,2118763.0,"to buy 5 sacks of cabbages, 1 basket of smoked...",,,
5462,Retail,0.0,True,5,"[{'first_name': 'Mame Bousso', 'last_name': ''...",,"['fr', 'en']",This group is made up of five women who share ...,,Ce groupe est composé de 05 femmes qui partage...,...,Monthly,12,"[{'due_date': '2018-12-01T08:00:00Z', 'amount'...",['Rural Exclusion'],Joanne Assheton,1046536.0,to buy vegetables and fish.,,,
5472,Livestock,0.0,True,10,"[{'first_name': 'Coumba', 'last_name': '', 'ge...",,"['fr', 'en']","This group is made up of ten women, who share ...",,Ce groupe est composé de 10 femmes qui partage...,...,Irregularly,12,"[{'due_date': '2019-04-01T07:00:00Z', 'amount'...",['Rural Exclusion'],Alison Le Bras,608879.0,to buy sheep.,,,
5474,Fish Selling,0.0,True,2,"[{'first_name': 'Nady', 'last_name': '', 'gend...",,"['fr', 'en']",This two-woman group was created in December 2...,,Ce groupe de 2 femmes a été crée en décembre 2...,...,Irregularly,14,"[{'due_date': '2018-12-01T08:00:00Z', 'amount'...",,Mary Lou Bradley,2317688.0,to buy fish to sell.,,,


Select loans that have expired **and** with loan amounts greater than 1000, we execute the following:

In [18]:
df[(df['status']=='expired')&(df['loan_amount']>1000)]

Unnamed: 0,activity,basket_amount,bonus_credit_eligibility,borrower_count,borrowers,currency_exchange_loss_amount,description_languages,description_texts_en,description_texts_es,description_texts_fr,...,terms_repayment_interval,terms_repayment_term,terms_scheduled_payments,themes,translator_byline,translator_image,use,video_id,video_thumbnailImageId,video_title
18,Food,,False,24,"[{'first_name': 'Tharcisse', 'last_name': '', ...",,"['fr', 'en']",Evariste is part of the Mageyo group and lives...,,Evariste fait partie du groupe Mageyo et habi...,...,Monthly,10,[],['Conflict Zones'],leonardo,1277720.0,to increase his capital and purchase a large q...,,,
20,General Store,,False,18,"[{'first_name': 'Emmanuel', 'last_name': '', '...",,"['fr', 'en']",Egide is part of the Yagurukundo group and liv...,,Egide fait partie du groupe Yagurukundo et h...,...,Monthly,9,[],['Conflict Zones'],leonardo,1277720.0,"to increase their capital and buy rice, beans,...",,,
21,Food,,False,26,"[{'first_name': 'Elie', 'last_name': '', 'gend...",,"['fr', 'en']",Isidore is a member of the Butanuka group and ...,,Isidore fait partie du groupe Butanuka et hab...,...,Monthly,9,[],['Conflict Zones'],Katharina S,340594.0,to buy palm oil for resale in order to earn more.,,,
24,Clothing Sales,,False,26,"[{'first_name': 'Aline', 'last_name': '', 'gen...",,"['fr', 'en']",Alexis is part of the Gitwe-Twitezimbere group...,,Alexis fait partie du groupe Gitwe-Twitezimber...,...,Monthly,10,[],['Conflict Zones'],leonardo,1277720.0,to increase his capital and buy clothing to re...,,,
25,Butcher Shop,,False,26,"[{'first_name': 'Spéciose', 'last_name': '', '...",,"['fr', 'en']",Marcien is a member of the group called Gitwe-...,,Marcien fait partie du groupe Gitwe-Twitezimbe...,...,Monthly,10,[],['Conflict Zones'],Teresa Kramer,1940938.0,to grow his working capital and buy a pig to s...,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3595,Animal Sales,,True,6,"[{'first_name': 'Dramane', 'last_name': '', 'g...",,"['fr', 'en']","Malamine is seen in the photo, holding the she...",,Sur la photo Malamine est celui qui tient son...,...,At end of term,10,[],"['Underfunded Areas', 'Rural Exclusion']",Sarah Ryder,1886035.0,to buy sheep for resale so he can build a house.,,,
3596,Livestock,,True,7,"[{'first_name': 'Aly', 'last_name': '', 'gende...",,"['fr', 'en']","Saly is thirty-nine years old, married and fat...",,Saly est un homme marié et père de 04 enfants...,...,At end of term,12,[],"['Underfunded Areas', 'Rural Exclusion']",Sarah Ryder,1886035.0,to buy livestock for resale.,,,
4584,Services,,True,1,"[{'first_name': 'Phatwell', 'last_name': '', '...",,['en'],Phatwell prides himself on the impact that he ...,,,...,At end of term,10,[],"['Mobile Technology', 'Start-Up', 'Job Creatio...",,,to increase his working capital and ensure tha...,,,
5340,Home Products Sales,,False,1,"[{'first_name': 'Richard ', 'last_name': '', '...",,['en'],Richard lives outside of Koforidua in the East...,,,...,Monthly,13,[],"['Underfunded Areas', 'Rural Exclusion']",,,to buy curtains and other interior decors for ...,,,


### 5. Merging and grouping data

You can group data by a column that has duplicates, like activity for the sector group.

In [19]:
df.groupby(['activity'])['loan_amount'].sum().reset_index()

Unnamed: 0,activity,loan_amount
0,Agriculture,273925
1,Animal Sales,380475
2,Arts,1450
3,Auto Repair,1425
4,Bakery,65300
...,...,...
129,Waste Management,425
130,Water Distribution,71300
131,Weaving,2800
132,Wedding Expenses,1175


You can also use SQL functions like inner join, outer join, left / right join using pd.merge(). Find documentation on this concept here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

## Great Resources for further information:

- [10 minute introduction to pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html)
- [Pandas in ipython notebooks](http://nbviewer.jupyter.org/github/jvns/pandas-cookbook/blob/master/cookbook/A%20quick%20tour%20of%20IPython%20Notebook.ipynb)
- [Read CSV, ZIP, JSON, or more from Pandas Library](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

In [20]:
!ls 

1_1 intro_to_python.ipynb
1_2_intro_to_numpy.ipynb
1_3_intro_to_pandas.ipynb
1_4_loading_and_understanding_data.ipynb
1_5_exploratory_data_analysis.ipynb
README.md
best_practices_data_science.pdf
[34mimages[m[m
intro_to_visualization.pptx
python_installation_instructions.md


In [21]:
## Example of installing a library via a Jupyter Notebook.
# !pip install "name of the library missing"