# Introduction & Univariate analysis

![](https://www.kdnuggets.com/wp-content/uploads/Fig1-Abisiga-top-10-lists-data-science.jpg "Data Science")

## Interesting resources: 
- Python Basics
 - https://courses.edx.org/courses/course-v1:IBM+PY0101EN+3T2020/course/ Module 1, 2 & 3
- DataFrame Basics
 - https://campus.datacamp.com/courses/introduction-to-data-science-in-python/loading-data-in-pandas?ex=1
 - https://campus.datacamp.com/courses/pandas-foundations/data-ingestion-inspection?ex=1 
  - https://www.coursera.org/learn/python-data-analysis?specialization=data-science-python#syllabus 

## Intro Python, Pandas and DataFrames

In [3]:
import pandas as pd

In [4]:
import seaborn as sns

In [None]:
iris = sns.load_dataset("iris")

In [None]:
iris

In [None]:
type(iris)

In [None]:
iris.head()

![](https://miro.medium.com/max/1000/1*Hh53mOF4Xy4eORjLilKOwA.png "Iris dataset") 

In [None]:
iris.columns

In [None]:
iris.dtypes

In [None]:
print(iris.shape)
print(len(iris))
print(len(iris.columns))

In [None]:
iris['sepal_length']

In [None]:
type(iris['sepal_length'])

In [None]:
iris[['sepal_length','sepal_width']]

In [None]:
columns = ['sepal_length','sepal_width']
iris[columns]

In [None]:
type(columns)

In [None]:
iris.loc[0:4] # inclusive! 0:4 are treated as labels, not indices!

[Good explanation about pandas](https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c)

In [None]:
iris.loc[0:4, columns]

In [None]:
versicolorFilter = iris['species'] == 'versicolor'
iris[versicolorFilter].head()

## Univariate analysis

Unvariate: Look at 1 column at a time.<br>
The type of univariate analysis that can be performed depends on the type of data the column has.
![](https://images.ctfassets.net/4e8xy1krjypg/A6Xf1MfISZhiQWuyGFDpV/b48be1afb29fcef49f596810281ba226/PillarPage-Qual-Quan-3.svg)

### Univariate analysis: Categorical data

Discrete values. Like enums.

In [None]:
iris['species']

In [None]:
iris['species'].unique()

In [None]:
iris['species'].value_counts()

In [None]:
iris['species'].value_counts().plot(kind='bar')

### Portfolio assignment 3
15 min: Perform a univariate analysis on all the categorical data of the penguins dataset. Commit the notebook to your portfolio when you're finished.
Optional: Start working on portfolio assignment 4 

In [None]:
penguins = sns.load_dataset("penguins")

In [None]:
penguins.head()

In [None]:
penguins['sex'].value_counts(dropna=False).plot(kind='bar')

![](https://i.imgur.com/0v1CGNV.png)

### Portfolio assignment 4
15 min: Look online for a datset that you personally find interesting to explore. It can be about any topic that you find interesting: sports, games, software development, etc. Commit the dataset to your portfolio. You will be analysing the dataset in future portfolio assignments.

Required characteristics of the dataset:
- Must be in a tabular format: Contains rows and columns
- Contains at least 100 rows
- Contains at least 2 columns with categorical data and at least 2 columns with numerical data
- Is less than 200 MB

![](https://i.imgur.com/0v1CGNV.png)

### Univariate analysis: Numerical data

In [None]:
column = 'sepal_length'
iris[column].min()

In [None]:
iris[column].max()

In [None]:
iris[column].mean()

![](https://danielmiessler.com/images/Mean-Median-Mode-and-Range-e1480829559507.png.webp "Mean")
![](https://cdn.corporatefinanceinstitute.com/assets/arithmetic-mean1-1024x159.png "Mean")

In [None]:
iris[column].median()

![](https://i.pinimg.com/originals/e1/83/9d/e1839de477171534dd55c9bca1d6ace3.png "Median") 

In [None]:
example_column1 = pd.Series([1,2,3,4, 1000])

In [None]:
print(example_column1.mean())
print(example_column1.median())

In [None]:
iris[column].std()

![](https://www.wallstreetmojo.com/wp-content/uploads/2019/05/Standard-Deviation-Formula.jpg "Standard deviation")

In [None]:
example_column1.std()

In [None]:
(((example_column1 - example_column1.mean())**2).sum() / ( len(example_column1)-1))**0.5

In [None]:
example_column2 = pd.Series([1,1,1,1,1,1])
example_column3 = pd.Series([1,2,3,4,5,6])
example_column4 = pd.Series([1,2,3,4,5,100])
example_column5 = pd.Series([10,20,30,40,50,60])

In [None]:
(example_column2.std(), example_column3.std(), example_column4.std(), example_column5.std())

In [None]:
iris[column].plot(kind='hist', bins = 10)

In [None]:
iris[column].plot(kind='box')

In [None]:
example_column4.plot(kind='box')

In [None]:
# remove the outlier
example_column4[example_column4 < 10].plot(kind='box') 

![](https://miro.medium.com/max/1400/1*2c21SkzJMf3frPXPAR_gZA.png)

![](https://naysan.ca/wp-content/uploads/2020/06/box_plot_ref_needed.png)

# Portfolio assignment 5
20 min: 
- Download lifeExpectancyAtBirth.csv from Brightspace ([original source](https://www.kaggle.com/utkarshxy/who-worldhealth-statistics-2020-complete?select=lifeExpectancyAtBirth.csv)).
- Move the file to the same folder as the Notebook that you will be working in.
- Load the dataset in your Notebook with the following code: lifeExpectancy = pd.read_csv('lifeExpectancyAtBirth.csv', sep=',')
- Look at the dataset with the .head() function.
- Filter the dataframe: We only want the life expectancy data about 2019 and 'Both sexes'
- Use this dataframe to perform a univariate analysis on the life expectancy in 2019.
- Which five countries have the highest life expectancy? Which five the lowest?

Commit the notebook and dataset to your portfolio when you're finished.

In [10]:
lifeExpectancy = pd.read_csv('lifeExpectancyAtBirth.csv', sep=',')

In [11]:
lifeExpectancy.head()

Unnamed: 0,Location,Period,Indicator,Dim1,First Tooltip
0,Afghanistan,2019,Life expectancy at birth (years),Both sexes,63.21
1,Afghanistan,2019,Life expectancy at birth (years),Male,63.29
2,Afghanistan,2019,Life expectancy at birth (years),Female,63.16
3,Afghanistan,2015,Life expectancy at birth (years),Both sexes,61.65
4,Afghanistan,2015,Life expectancy at birth (years),Male,61.04


In [16]:
filter1 = lifeExpectancy['Period'] = 2019
lifeExpectancy[filter1]

KeyError: 2019

![](https://i.imgur.com/0v1CGNV.png)

### Portfolio assignment 6
60 min: Perform a univariate analysis on at least 2 columns with categorical data and on at least 2 columns with numerical data in the dataset that you chose in portfolio assignment 4. Commit the Notebook to your portfolio when you're finished.

![](https://i.imgur.com/0v1CGNV.png)