### What is Pandas?

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. The name is derived from the term **"panel data"**, an econometrics term for data sets that include observations over multiple time periods for the same individuals.

 Wes McKinney started working on pandas in 2008 while at AQR Capital Management out of the need for a high performance, flexible tool to perform quantitative analysis on financial data. Before leaving AQR he was able to convince management to allow him to open source the library.

Another AQR employee, Chang She, joined the effort in 2012 as the second major contributor to the library. 

It is written in Python, Cython, C; and the initial release was 11th January, 2008.

Wes McKinney has been named **Benevolent dictator for life (BDFL)**, a title given to a small number of open-source software development leaders, typically project founders who retain the final say in disputes or arguments within the community.

![Wes McKinney](https://budapestbiforum.hu/2015/en/wp-content/uploads/sites/3/2015/09/eloadok-slide-1en.jpg)

### We are now going import our pandas library and also load our data


In [None]:
import pandas as pd

## Series
A Series is a one-dimensional array-like object containing a sequence of values (of
similar types to NumPy types) and an associated array of data labels, called its index.
The simplest Series is formed from only an array of data

In [None]:
obj = pd.Series([4, 7, -5, 3])
obj

Often it will be desirable to create a Series with an index identifying each data point
with a label

In [None]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

We can choose to create a list with the index names we intend to use in our Series

In [None]:
ind = ['aleph', 'beth', 'gimel', 'dalet']

In [None]:
obj3 = pd.Series([-13, 0, 34, 55], index = ind)
obj3

Let us convert a python dictionary into a pandas series

In [None]:
reg_voters = {'North Central': 10586965, 'North East': 9929015, 'North West': 18505984,
              'South East': 8293093, 'South South' : 11101093, 'South West' : 14626800}
reg_voters

We now convert the dictionary 'reg_voters' into a pandas series

In [None]:
obj4 = pd.Series(reg_voters)
obj4

In [None]:
type(obj4)

### DataFrame
A DataFrame represents a rectangular table of data and contains an ordered collec‐
tion of columns, each of which can be a different value type (numeric, string,
boolean, etc.). The DataFrame has both a row and column index; it can be thought of
as a dictionary of Series all sharing the same index.

We will create a dictionary for the minimum and maximum temperatures for the city of Abuja across the months.

In [None]:
data = {'Month' : ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],
       'Min Temp (C)' : [20, 26, 24, 25, 20, 18, 22, 18,18, 21, 16, 16],
       'Max Temp (C)' : [35, 37, 37, 36, 33, 31,29, 29, 30, 32, 34, 35]}

data

We now convert the created dictionary into a DataFrame

In [None]:
df = pd.DataFrame(data)
df

we can also try to create a list of index terms to use in our DataFrame.

In [None]:
my_index=['alif', 'baa', 'taa', 'thaa', 'jiim', 'haa', 'kha', 'daal', 'thaal', 'raa', 'zaay', 'siin']


In [None]:
df1 = pd.DataFrame(data, index=my_index)
df1

we can choose to use a column as our index for instance the 'month' column in this case

In [None]:
df.set_index('Month', inplace=True)
df

### Loading data from a file

Our file is a "tab-separated" file (or tsv) instead of a "comma-separated" file (or csv). we need to set the "delimiter" attribute to indicate that as it is not the default setting.

We now get a variable and save the file in it.

In [None]:
# load data
gap = pd.read_csv('gapminder.tsv',delimiter ='\t')


We want to use different ways to view the files.

In [None]:
# to view the first top 5 entries 
gap.head()

In [None]:
# to view the last bottom 5 entries 
gap.tail()

In [None]:
#to view all at once
gap

In [None]:
# to see the total number of entries or rows
print(len(gap))

We may want to visualize all the entries. We can write a function to help us do so.

In [None]:
def display_all(df):
    with pd.option_context("display.max_rows", 14000, "display.max_columns", 1000): 
        display(df)


In [None]:
display_all(gap)

In [None]:
type(gap)

In [None]:
gap.shape

In [None]:
gap.info()

In [None]:
gap.head(0)

In [None]:
country = gap['country']
country.head()

In [None]:
continent = gap['continent']
continent.head()

In [None]:
gap[["country","continent", "year", "lifeExp"]].head()

In [None]:
# use the attribute 'columns' to get the column names or headers
gap.columns

In [None]:
# to delete a column 'country'
del gap['country']

In [None]:
gap.head()

In [None]:
# we can delete a column using the 'drop' mathod
gap = gap.drop('continent', axis=1) # default axis is set to '0' for rows but we set it to '1' for columns
gap.head()

In [None]:
# we can drop columns without assignment using 'inplace'
gap.drop('pop', axis=1,inplace=True)
gap.head()

We can also load a dataframe from a url. In this case, we want to see drinks by country

In [None]:
path = 'http://bit.ly/drinksbycountry' 

 Introducing the **f-string** formatting.

In [None]:
drinks = pd.read_csv(f'{path}')

Also called **“formatted string literals,”** f-strings are string literals that have an f at the beginning and curly braces containing expressions that will be replaced with their values (in our case above, **f'{path}'**).  You can go to this [link](https://realpython.com/python-f-strings/) to read more.

In [None]:
drinks.head()

### HomeWork
You may use this new data on 'DrinksByCountry' and make the same manipulations we did using the 'GapMinder' data we used initially.

### To subset rows using 'loc' and 'iloc'

our dataset is messed up. lets reload it again like we did in the begining


In [None]:
df

In [None]:
# 'loc'
df.loc['Feb']

In [None]:
type(df.loc['Feb'])

In [None]:
type(df)

In [None]:
df.loc['Dec']

In [None]:
# 'iloc'
df.iloc[0]

In [None]:
df.iloc[-1]

In [None]:
df.loc[['Jan','Mar','Aug']]

In [None]:
df.loc[['Jan','Mar','Aug'],'Min Temp (C)']

In [None]:
df.iloc[[0, 4, 9],1]

'loc' deals with labels and locations of our row and column entries while 'iloc' deals with the index or number locations of our row and column entries

#### Boolean subsetting

In [None]:
# reload the gap data
gap = pd.read_csv('gapminder.tsv',delimiter ='\t')


In [None]:
# lets find the mean of a subset 'lifeExp'
leMean = gap['lifeExp'].mean()
leMean

In [None]:
gap.loc[gap['lifeExp']>leMean, :].tail(10)

In [None]:
gap.loc[(gap.continent == 'Africa'), ['country', 'lifeExp', 'gdpPercap']].head()

In [None]:
gap.loc[(gap.continent == 'Africa'), :].head()

In [None]:
Naija = gap.loc[(gap.country == 'Nigeria'), :]
Naija

### Groupby or 'Split', 'Apply', and 'Combine'

In [None]:
grpGap_country = gap.groupby('country')
grpGap_country

In [None]:
grpGap_country.get_group('Nigeria')

In [None]:
gap.groupby('year')['lifeExp'].mean().reset_index()

In [None]:
aveLE = gap1.groupby('country')['lifeExp'].mean().reset_index()
aveLE.head()

we use 'reset_index' to flatten the dataset and also to convert a pandas series to a pandas data frame

In [None]:
avePerCoun = gap.groupby('country')[['lifeExp', 'gdpPercap']].mean()

In [None]:
avePerCoun.reset_index().head()

In [None]:
gap.groupby(['year', 'continent'])[['lifeExp', 'gdpPercap']].mean().reset_index().head()

In [None]:
# method 'nunique' shows unique elements
gap.groupby('continent')['country'].nunique().reset_index()

### Lets plot some stuff
##### 'Groupby Year & Life Expectancy' or 'gyle'


In [None]:
#lets plot some lines
import matplotlib.pyplot as plt
%matplotlib inline
gyle = gap.groupby('year')['lifeExp'].mean()
gyle.plot()

In [None]:
type(gyle)

In [None]:
# convert back from a series to dataframe
gyLE = gyle.reset_index()

gyLE

In [None]:
gyLE.to_csv('LifeExp_by_year.csv', index=False)