$\LARGE\text{Data analysis with numpy and pandas - part 2}$

$\small\text{Ralph Tambala}$

# Pandas

Pandas is very useful when working with tabular or structured data (like $R$ dataframe, SQL table, Excel spreadsheet, ...).

## Basic Concepts

**What is pandas?**

Pandas can be thought of as NumPy arrays with labels for rows and columns, and better support for heterogeneous data types, but it's also much, much more than that.

**Why pandas?**

1. Import data
2. Clean up messy data
7. Explore data, gain insight into data
7. Process and prepare your data for analysis
5. Analyse your data (together with scikit-learn, statsmodels, ...)

**Data analysis with pandas**

- Fast, easy and flexible input/output for a lot of different data formats
- Working with missing data (<code>.dropna()</code>, <code>pd.isnull()</code>)
- Merging and joining (<code>concat</code>, <code>join</code>)
- Grouping: <code>groupby</code> functionality
- Reshaping (<code>stack</code>, <code>pivot</code>)
- Powerful time series manipulation (resampling, timezones, ..)
- Easy plotting

## Import <code>pandas</code> library

Just as we generally import NumPy under the alias <code>np</code>, we will import Pandas under the alias <code>pd</code>:

In [1]:
import pandas as pd

## <code>pandas</code> data structures

The primary data structures in pandas are implemented as two classes:
1. <code>Series</code>; which is a single column.
2. <code>DataFrame</code>; a relational data table, with rows and named columns. A <code>DataFrame</code> contains one or more Series and a name for each <code>Series</code>.

In [2]:
cities = pd.Series(['Blantyre', 'Lilongwe', 'Mzuzu', 'Zomba'])
cities

0    Blantyre
1    Lilongwe
2       Mzuzu
3       Zomba
dtype: object

In [3]:
type(cities)

pandas.core.series.Series

In [None]:
cities.size

In [None]:
# another series with corresponding city population in 2018
popn = pd.Series([800264, 989318, 221272, 105013])

In [None]:
city_popn = pd.DataFrame({'City': cities, 'Population': popn})
city_popn

# Importing Data

There are several ways one can import data using <code>pandas</code>. In this tutorial, the dataset is a CSV file, and the function we are going to use to read in the file is called <code>pd.read_csv()</code>. This function returns a <code>DataFrame</code>.

In [None]:
import pandas as pd 
df = pd.read_csv('super_league_2020_day26_less.csv')

Now that we have our dataframe in our variable <code>df</code>, let's look at what it contains.

In [None]:
df

We can use the function <code>head()</code> to see the first couple rows of the dataframe (or the function <code>tail()</code> to see the last few rows).

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df[:2] # we can also slice rows as with numpy or lists

*Sno* is a a serial number assigned to each team. Say we want to drop column *Sno* since it does not add any important detail to the dataset. We use <code>drop()</code>

In [None]:
df.drop('Sno', axis=1, inplace=True)
df.head()

Next, let's more useful columns to the dataset. For instance, a total number of games played by each team.

In [None]:
df['W'] + df['D']+ df['L'] # to find number of games played by each team => wins + losses + draws

In [None]:
df['P'] = df['W'] + df['D'] + df['L']
df.head()

Just for aesthetics, we are going to arrange the column names

In [None]:
df.columns.to_list() # to print column names as a list

In [None]:
df = df[['Team Name', 'P', 'W', 'D', 'L', 'GF', 'GA']]
df.head()

In [None]:
df['GD'] = df['GF'] - df['GA']
df['PTS'] = df['W'] * 3 + df['D']

In [None]:
df.head()

# Descriptive analysis and exploration

In [None]:
df.describe()

Okay, so now let's looking at information that we want to extract from the dataframe. Let's say we want to know the max value of for each column. The function <code>max()</code> will show you the maximum values of all column

In [None]:
df.max()

Say we would like to know the mean of the games played.

In [None]:
df['P'].mean()

Then, if you'd like to specifically get the max value for a particular column, you pass in the name of the column using the bracket indexing operator. In this case, we are looking for a team that can possibly win the league.

In [None]:
df['PTS'].max()

All teams with at least 12 wins.

In [None]:
df[df['W'] >= 12]

All teams that have scored at least 30 goals.

In [None]:
df[df['GF'] >= 30]

Teams with over 12 wins and less than 28 games played

In [None]:
df[(df['W'] > 12) & (df['P'] < 28)]

In [None]:
df['GF'].sum() # total number of goals scored

In [None]:
df.to_numpy()

In [None]:
df.T

In [None]:
df.sort_values(by='PTS', ascending=False, inplace=True)
df[:3]

In [None]:
df.to_csv('super_league_detailed.csv')

# Missing Data

<code>pandas</code> primarily uses the <code>np.nan</code> to represent missing data.

Let's load another dataset to look at how one can deal with missing data. It is therefore imperative to import <code>numpy</code> whenever we use <code>pandas</code>.

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('super_league_2020_day26_missing.csv')
df

In [None]:
# To get the boolean mask where values are NaN 
pd.isna(df)

In [None]:
df.isnull() # To get the boolean mask where values are NaN

 <code>dropna</code> function allows you to drop all(or some) of the rows that have missing values.

In [None]:
df.dropna() # let's drop any rows with NaN

<code>fillna</code> function replaces all missing values with a specific value

In [None]:
df.fillna(value=10) # We are replacing all missing values with a 10

In [None]:
df.fillna(df.mean()) # or we can use the mean of each column to replace all the missing data

# Practice Work

1. Import the pandas package under the name pd.

2. Write code to import _some_data.csv_ (the CSV file has been provided) using pandas.

3. Let's look at the 3 first rows of the dataset.

4. Let's print the shape of the dataset.

5. Find the mean of the ages in the dataset.

6. Let's count the unique cities in the dataset. _Hint: Select the column in question and apply <code>unique()</code>._

7. Below I have provided the code to group the data by gender. Let's use a similar code to to group the data by city.

In [None]:
groups = df.groupby('gender')

groups.count()

In [None]:
# code here

8. Find all the rows in our dataset where age is more than 15 years

9. Find all the rows in our dataset where city is Lilongwe.

10. Find all the rows in our dataset where city is Lilongwe and age is 15 years or above.