The goal of this tutorial is to talk about the important parts of beginning data analysis.
The typical analysis pipeline goes through the following stages:
- Think about the data you would like
- Either find a way to collect that data, or find data that already exists
- sometimes you might have to compromise on data because it's easier to just use stuff that exists already
- I have provided links to datasets above.
- For this tutorial, there is a titanic dataset
- Write code that takes the data from a file or database and loads it into a data structure
- We will be using Pandas, a data management library
- Pandas makes manipulating data really easy
- Write code that puts the data into different forms that match the task you want to do.
- For instance, if you want to view interesting properties of your data as a scatter plot, you need to get two lists: one for the x positions and 1 for the y positions
- You should be thinking about what kinds of things the data can tell you
I will be writing this tutorial while looking at the titanic dataset. The titanic dataset is a list of passengers, information about them, and whether they survived or not.
I have made the data easy to get: :: from urllib import request import pandas as pd filepath = 'https://gist.githubusercontent.com/braingineer/5d15057ac482ee0130b6d0e6f9cc9311/raw/d4eefaecc98b342ec578cf3512184556e8856750/titanic.csv' response = request.urlopen(filepath) df = pd.read_csv(response) df = df.fillna(0)
Selecting a column :: age_column = df['Age']
Selecting a subset :: df2 = df[age_column > 0]
View the columns :: print(df2.columns)
Visualize a scatter plot :: plt.scatter(df2['Survived'], df2['Age']); # or with columns out surv_col = df2['Survived'] age_col = df2['Age']
If you don't already have it, to install seaborn, type in a single cell in your Jupyter Notebook: :: !pip install seaborn
Then, you can do the following: :: import seaborn as sns sns.barplot(data=df, x='Pclass', y='Survived')
You can see more examples of seaborn plots at the seaborn website
Some examples to get you started: :: sns.countplot(data=df, x='Sex', hue='Survived')
### do these in different cells otherwise they will try to plot on top of each other sns.factorplot(data=df, x='Pclass', y='Age', col='Sex', kind='swarm', hue='Survived', x_order=[1, 2, 3])
To use data for science, you want to get summarize what happened. In other words, you want to tell a story with the data. To do this, you have to look at the different properties: counts, means, proportions, etc.
A good way to formulate a scientific question is to think about different groups. If the rate at which something happens is different between the two groups, then there is an effect of group.
- Proportion: A proportion is a number between 0 and 1 that signifies the part to whole relationship.
- If you eat half of a cake, the proportion you ate is 0.5
- Percentage: A percentage is a number between 0 and 100 that signifies the part to whole relationship
- If you eat half of a cake, the percentage is 50%
- How many people were on the Titanic?
- What percentage of the passengers did not survive?
- How many of the passengers were male? How many were female?
- How many male passengers survived? How many female? Is there an interesting relationship?
- What is the proportion of 3rd class passengers who survived?
- Is there an effect of class on the survivability of the gender?
- What is the mean age per class?
A version I was working that renames and cleans a version of the dataset: :: from urllib import request import pandas as pd import seaborn as sns %matplotlib inline filepath = 'https://gist.githubusercontent.com/braingineer/5d15057ac482ee0130b6d0e6f9cc9311/raw/d4eefaecc98b342ec578cf3512184556e8856750/titanic.csv' response = request.urlopen(filepath) df = pd.read_csv(response) df = df.fillna(0) cols = df.columns.values idx = list(cols).index('Pclass') cols[idx] = "Class" df.columns = cols df_clean = df[df['Age']>0]
And a couple extra plots I was looking at: :: ### super fancy sns.factorplot(data=df_clean, kind='violin', split=True, inner='stick', scale='count', x='Class', y='Age', hue='Survived', col='Sex')
### really sad sns.factorplot(data=df_clean, kind='bar', col='Class', x='SibSp', y='Age', hue='Survived', row='Sex')