# Introduction to Pandas: A Short Tutorial

* Pandas is an open source, BSD-licensed library
* High-performance, easy-to-use data structures and data analysis tools
* Built on top of NumPy, and provides an efficient implementation of a DataFrame
* Makes data analysis fast and easy in Python

**DataFrame**: A multidimensional array with attached row and column labels

### **Important Information**

(1) To answer these exercises, you **must first read Chapter 2: Dapa Manipulation with Pandas from the Python Data Science Handbook** (https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html)


**Pandas API Reference**: https://pandas.pydata.org/pandas-docs/stable/reference/index.html

#### **This is an introduction to Pandas. You can try to complete the tutorial yourself and check the model answers for help, when needed**

In [1]:
# Import Pandas
import numpy as np
import pandas as pd

In [2]:
# Create two lists with information from Baby names in England and Wales: 2018
# https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/
names  = ['Adam', 'Sophie', 'Charlie', 'Anna', 'Bobby', 'Florence', 'George', 'Mia']
births = [1508,   1929,      3336,     409,     652,    1974,        4949,    2418]

# Merge these two lists together using the zip function
# https://docs.python.org/3.3/library/functions.html
babiesDataSet = list(zip(names, births))

In [3]:
# Print the combined list
babiesDataSet

[('Adam', 1508),
 ('Sophie', 1929),
 ('Charlie', 3336),
 ('Anna', 409),
 ('Bobby', 652),
 ('Florence', 1974),
 ('George', 4949),
 ('Mia', 2418)]

In [4]:
# Use Pandas to create a dataframe
df = pd.DataFrame(data=babiesDataSet, columns=['Name', "Births"])

In [5]:
# Display the dataframe
df

Unnamed: 0,Name,Births
0,Adam,1508
1,Sophie,1929
2,Charlie,3336
3,Anna,409
4,Bobby,652
5,Florence,1974
6,George,4949
7,Mia,2418


In [6]:
# Export the dataframe to csv
# You can find the CSV file in the same directory with this Jupyter Notebook
# (If you are using Google Colaboratory this should be in the virtual directory)
df.to_csv("birthsUK2018.csv", index=False, header=False)

In [7]:
# Import data to dataframe
file = "birthsUK2018.csv" #location is relative
births =  pd.read_csv(file, header=None, names=['Name', 'Births'])

In [8]:
# Show the dataframe
# The numbers [0,1,2,3,4] in the first column are part of the index of the dataframe. 
births

Unnamed: 0,Name,Births
0,Adam,1508
1,Sophie,1929
2,Charlie,3336
3,Anna,409
4,Bobby,652
5,Florence,1974
6,George,4949
7,Mia,2418


In [9]:
# Check the data types of columns
births.dtypes

Name      object
Births     int64
dtype: object

In [10]:
# Get general info about the dataframe
# - There are 8 records in the data set
# - There is a column named "Name" of type object (non numeric) with 8 values
# - There is a column named "Births" of type numeric with 8 values
births.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    8 non-null      object
 1   Births  8 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 160.0+ bytes


In [11]:
# Check the data types of Births column
births.Births.dtype

dtype('int64')

In [12]:
# Print the top 5 rows
# Write your code here
births[:5]

Unnamed: 0,Name,Births
0,Adam,1508
1,Sophie,1929
2,Charlie,3336
3,Anna,409
4,Bobby,652


In [13]:
# Print the last 3 rows
# Write your code here
births[-3:]

Unnamed: 0,Name,Births
5,Florence,1974
6,George,4949
7,Mia,2418


In [14]:
# Print the name of columns
# Write your code here
births.columns

Index(['Name', 'Births'], dtype='object')

In [15]:
# Transfrom the datafram into an array
# Write your code here
births.values

array([['Adam', 1508],
       ['Sophie', 1929],
       ['Charlie', 3336],
       ['Anna', 409],
       ['Bobby', 652],
       ['Florence', 1974],
       ['George', 4949],
       ['Mia', 2418]], dtype=object)

In [16]:
# Get the index of the dataframe
# Write your code here
births.index

RangeIndex(start=0, stop=8, step=1)

In [17]:
# Access the entries of the Name column (using dataframe slicing)
# Write your code here
births['Name']

0        Adam
1      Sophie
2     Charlie
3        Anna
4       Bobby
5    Florence
6      George
7         Mia
Name: Name, dtype: object

In [19]:
# Access the entries of the Births column as a property
# Write your code here
births.Births

0    1508
1    1929
2    3336
3     409
4     652
5    1974
6    4949
7    2418
Name: Births, dtype: int64

In [20]:
# Find the maximum number of births
# Write your code here
births['Births'].max()

4949

In [22]:
# Get the name associated with the max births
# Write your code here
births['Name'][births['Births'].idxmax()]

'George'

In [23]:
# Find the unique names
# Write your code here
births['Name'].unique()

array(['Adam', 'Sophie', 'Charlie', 'Anna', 'Bobby', 'Florence', 'George',
       'Mia'], dtype=object)

In [25]:
# Get some descriptive statistics for the number of births
# Write your code here
print("MEAN BIRTHS")
print(births['Births'].mean())
print("MEDIAN BIRTHS")
print(births['Births'].median())
print("MIN BIRTHS")
print(births['Births'].min())
print("MAX BIRTHS")
print(births['Births'].max())

MEAN BIRTHS
2146.875
MEDIAN BIRTHS
1951.5
MIN BIRTHS
409
MAX BIRTHS
4949


In [28]:
# Get the names with births more than 2000
# Write your code here
print(births['Name'][births['Births']>2000])

2    Charlie
6     George
7        Mia
Name: Name, dtype: object


In [33]:
# Get the names starting with "A"
# Write your code here
print(births['Name'][births['Name'].str.startswith("A")])

0    Adam
3    Anna
Name: Name, dtype: object


In [35]:
# Add another column with the country set for all rows as UK
# Tip: Use numpy function repeat (if needed)
# Write your code here
import numpy as np
births['Country'] = np.repeat("UK", len(births))
births

Unnamed: 0,Name,Births,Country
0,Adam,1508,UK
1,Sophie,1929,UK
2,Charlie,3336,UK
3,Anna,409,UK
4,Bobby,652,UK
5,Florence,1974,UK
6,George,4949,UK
7,Mia,2418,UK


In [39]:
# Add a column with the gender of the babies
# Assume the genders are alternating "Male", "Female"
# e.g., Adam -> Male, Sophie -> Female, Charlie -> Male, Anna -> Female etc
# Tip: Use the NumPy function tile if needed
# Write your code here
births['Gender'] = np.tile(("Male", "Female"), int(len(births)/2))
births

Unnamed: 0,Name,Births,Country,Gender
0,Adam,1508,UK,Male
1,Sophie,1929,UK,Female
2,Charlie,3336,UK,Male
3,Anna,409,UK,Female
4,Bobby,652,UK,Male
5,Florence,1974,UK,Female
6,George,4949,UK,Male
7,Mia,2418,UK,Female


In [43]:
# Add a column that indicates for each name its percentage over the total births
# Write your code here
birthSum = births['Births'].sum()
births['birthPercentage'] = (births['Births'] / birthSum) * 100
births

Unnamed: 0,Name,Births,Country,Gender,birthPercentage
0,Adam,1508,UK,Male,8.780204
1,Sophie,1929,UK,Female,11.231441
2,Charlie,3336,UK,Male,19.423581
3,Anna,409,UK,Female,2.381368
4,Bobby,652,UK,Male,3.796215
5,Florence,1974,UK,Female,11.49345
6,George,4949,UK,Male,28.815138
7,Mia,2418,UK,Female,14.078603


In [47]:
# Delete the country column
# Write your code here
births.drop(["Country"], axis=1, inplace=True)
births

Unnamed: 0,Name,Births,Gender,birthPercentage
0,Adam,1508,Male,8.780204
1,Sophie,1929,Female,11.231441
2,Charlie,3336,Male,19.423581
3,Anna,409,Female,2.381368
4,Bobby,652,Male,3.796215
5,Florence,1974,Female,11.49345
6,George,4949,Male,28.815138
7,Mia,2418,Female,14.078603


In [49]:
# Create and print to a new dataframe only columns Name, Births, and Percentage
# Write your code here
# newBirths = births[['Name', 'Births', 'Percentage']]
# newBirths

KeyError: "['Percentage'] not in index"

In [None]:
# Subset the data based on index location to get the first 3 records
# Tip: Use the iloc property of the dataframe
# Write your code here


In [None]:
# Get the names that belong to female babies using the query function of a dataframe
# Tip: Use the 'at' property of a dataframe
# Write your code here


In [None]:
# Get the names that belong to female babies by slicing 
# Write your code here


In [None]:
# Get the names whose births are below 1000 and are male using the query function of a dataframe
# Write your code here


In [None]:
# Get the names whose births are below 1000 and are male by slicing
# Write your code here


In [None]:
# Get the number of births groupped by gender
# Write your code here


In [None]:
# Sort the dataframe by name
# Write your code here


In [None]:
# Sort the dataframe by the number of births in descending order
# Write your code here


In [None]:
# Add a column with the county each child has been born in as given by the list below
county = ['Yorkshire', 'Essex', 'Yorkshire', 'Yorkshire', 'Kent', 'Kent', 'Yorkshire', 'Essex']
# Write your code here


In [None]:
# Group the data based on Gender and then by County
# Write your code here
