CM Hub: Data processing with Pandas
One-day data processing with Pandas course for the CM Hub at Imperial College
- Part 1: Pandas
- Part 2: awk
Part 1. Pandas
Pandas is a Python package.
0. Before we begin
- You need Python working on your system.
pip install pandas
pip install xlrd
pip install statsmodels
1. Why you might want to use Pandas
Pros of Pandas:
- Is part of Python and Python is great
- Good for large data sets
- Powerful tools for dealing with broken data (etc.)
Cons of Pandas:
- You need to know Python
- No graphical interface
- You need a few extra packages
- Eats shoots and leaves
In short: The setup cost is often worthwhile to allow you powerful data manipulation features, which you can then connect to some useful Python code.
2. Let's get started
- Join in. Open a Jupyter notebook by typing
jupyter notebookin the terminal.
- Jupyter notebook specific: We're going to want to display our graphs inside the notebook, rather than externally. Do to that, type
- We're going to be using Python 3. If you've got Python 2, that's OK, just start by typing
from __future__ import print_function
Test it works!
import pandas as pd
import matplotlib.pyplot as plt
OK let's go
data = pd.read_csv("A1_mosquito_data.csv")
data(fine within a Jupyter notebook)
Let's pick out things:
- Print the last 3 rows
- Print the first two rows, but only the year and rainfall columns
- Try to print a single row
data['temperature'])(so you can select a single row this wy)
data['temperature'][data['temperature'] > 75]
data['temperature'][data['year'] > 2005]
data.mean(1))(not very useful here)
- Print the mean number of mosquitos in the years where the rainfall was more than 200mm.
- Now do the same for when the rainfall was less than 200mm.
- Print the standard deviation of the temperatures (you may have to Google!)
3. Loops and ifs
These work in the usual Python way:
for index, row in data.iterrows(): temp_in_f = row['temperature'] temp_in_c = (temp_in_f - 32) * 5 / 9.0 print(temp_in_c)
- We can add in:
if temp_in_c > 23: print("Hotter than 23C!")
- Import the data from
data2, determine the mean temperature, and loop over the temperature values. For each value print out whether it is greater than the mean, less than the mean, or equal to the mean.
mosquitos_vs_year = data[['year','mosquitos']]
- Using the
data2dataset, plot the number of mosquitos against the rainfall. What needs fixing?
plt.title('Mosquitoes like water')
- Plot the number of mosquitoes against the temperature in Celsius.
Let's group the data by temperature and then plot the mean number of mosquitoes for each temperature (to the nearest degree).
mosquito_data_only = data[['temperature','mosquitos']]
- Do the same with the larger data file.
Binning sorts data into intervals, or 'bins'.
bins = [0,200,250,300] labels = ['dry','normal','wet'] pd.cut(data['rainfall'],bins,labels=labels).value_counts().plot(kind='pie')
7. Adding columns
We can add columns:
data['temperature_celsius'] = (data['temperature']-32)*5/9.
We can sort data:
And the scatter matrix:
from pandas.plotting import scatter_matrix scatter_matrix(data[['temperature','rainfall','mosquitos']])
import statsmodels.api as sm x = data["rainfall"] y = data["mosquitos"] X = sm.add_constant(x) # add y-intercept model = sm.OLS(y, X).fit() # Ordinary Least Squares predictions = model.predict(X) model.summary()
coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 49.8413 3.223 15.465 0.000 42.409 57.273 rainfall 0.6592 0.015 44.965 0.000 0.625 0.693
we see mosquitos = 0.6592*rainfall + 49.8413
11. Miscellaneous things
We can ignore missing or NaN values:
- Import from Excel
You're the director of undergraduate studies. You have to do a few things.
- You need to give prizes to the five students taking Physics with the top mean marks over all four modules. Which students get the prizes?
- The staff member running the tutor group with the highest mean mark gets a beer. Which group's tutor gets the beer?
- You need to report the mean mark for each course to the faculty. List the four courses by order of mean mark. Plot these on a bar chart so they can understand it.
- Scores above 70% are a 'first'. Scores between 60 and 69% are an 'upper second', between 50 and 59% a 'lower second', between 40 and 49% a 'third', and 39% and below is a fail. For Quantum Mechanics, plot a pie chart showing the number of students who fall in each of these categories.
- Students on the Physics programme pass the year if they score more than 40% on three out of four modules. Otherwise they fail. How many students failed? Loop through the failing students, printing out a personalised statement (imagine that you will code it so Python emails it to them) telling them they've failed.
- Rumour has it the scores for Lab Work have been made up. Create a scatter matrix for the four courses. What does this tell you?
- Do a linear regression analysis to come up with a linear model for the Waves score based on the Relativity score.
Mosquitos example adapted from Software Carpentry. CC BY 4.0.