Data Analysis Tutorial

More datasets

Simpler Datasets
A huge list of datasets
538's datasets

Overview

The goal of this tutorial is to talk about the important parts of beginning data analysis.

The typical analysis pipeline goes through the following stages:

Think about the data you would like
Either find a way to collect that data, or find data that already exists
- sometimes you might have to compromise on data because it's easier to just use stuff that exists already
- I have provided links to datasets above.
- For this tutorial, there is a titanic dataset
Write code that takes the data from a file or database and loads it into a data structure
- We will be using Pandas, a data management library
- Pandas makes manipulating data really easy
Write code that puts the data into different forms that match the task you want to do.
- For instance, if you want to view interesting properties of your data as a scatter plot, you need to get two lists: one for the x positions and 1 for the y positions
- You should be thinking about what kinds of things the data can tell you

I will be writing this tutorial while looking at the titanic dataset. The titanic dataset is a list of passengers, information about them, and whether they survived or not.

Getting the Data

I have made the data easy to get: :: from urllib import request import pandas as pd filepath = 'https://gist.githubusercontent.com/braingineer/5d15057ac482ee0130b6d0e6f9cc9311/raw/d4eefaecc98b342ec578cf3512184556e8856750/titanic.csv' response = request.urlopen(filepath) df = pd.read_csv(response) df = df.fillna(0)

Using Pandas and Matplotlib

Some example tutorials

Simple Graphics
Beautiful Plots

Some simple operations

Selecting a column :: age_column = df['Age']

Selecting a subset :: df2 = df[age_column > 0]

View the columns :: print(df2.columns)

Visualize a scatter plot :: plt.scatter(df2['Survived'], df2['Age']); # or with columns out surv_col = df2['Survived'] age_col = df2['Age']

Seaborn

If you don't already have it, to install seaborn, type in a single cell in your Jupyter Notebook: :: !pip install seaborn

Then, you can do the following: :: import seaborn as sns sns.barplot(data=df, x='Pclass', y='Survived')

You can see more examples of seaborn plots at the seaborn website

Some examples to get you started: :: sns.countplot(data=df, x='Sex', hue='Survived')

### do these in different cells otherwise they will try to plot on top of each other sns.factorplot(data=df, x='Pclass', y='Age', col='Sex', kind='swarm', hue='Survived', x_order=[1, 2, 3])

Science

To use data for science, you want to get summarize what happened. In other words, you want to tell a story with the data. To do this, you have to look at the different properties: counts, means, proportions, etc.

A good way to formulate a scientific question is to think about different groups. If the rate at which something happens is different between the two groups, then there is an effect of group.

Some terminology

Proportion: A proportion is a number between 0 and 1 that signifies the part to whole relationship.
- If you eat half of a cake, the proportion you ate is 0.5
Percentage: A percentage is a number between 0 and 100 that signifies the part to whole relationship
- If you eat half of a cake, the percentage is 50%

Questions you can ask

How many people were on the Titanic?
What percentage of the passengers did not survive?
How many of the passengers were male? How many were female?
How many male passengers survived? How many female? Is there an interesting relationship?
What is the proportion of 3rd class passengers who survived?
Is there an effect of class on the survivability of the gender?
What is the mean age per class?

Additional setup

A version I was working that renames and cleans a version of the dataset: :: from urllib import request import pandas as pd import seaborn as sns %matplotlib inline filepath = 'https://gist.githubusercontent.com/braingineer/5d15057ac482ee0130b6d0e6f9cc9311/raw/d4eefaecc98b342ec578cf3512184556e8856750/titanic.csv' response = request.urlopen(filepath) df = pd.read_csv(response) df = df.fillna(0) cols = df.columns.values idx = list(cols).index('Pclass') cols[idx] = "Class" df.columns = cols df_clean = df[df['Age']>0]

And a couple extra plots I was looking at: :: ### super fancy sns.factorplot(data=df_clean, kind='violin', split=True, inner='stick', scale='count', x='Class', y='Age', hue='Survived', col='Sex')

### really sad sns.factorplot(data=df_clean, kind='bar', col='Class', x='SibSp', y='Age', hue='Survived', row='Sex')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_analysis.rst

data_analysis.rst

Data Analysis Tutorial

More datasets

Overview

Getting the Data

Using Pandas and Matplotlib

Some example tutorials

Some simple operations

Seaborn

Science

Some terminology

Questions you can ask

Additional setup

Files

data_analysis.rst

Latest commit

History

data_analysis.rst

File metadata and controls

Data Analysis Tutorial

More datasets

Overview

Getting the Data

Using Pandas and Matplotlib

Some example tutorials

Some simple operations

Seaborn

Science

Some terminology

Questions you can ask

Additional setup