# Introduction & Univariate analysis

![](https://www.kdnuggets.com/wp-content/uploads/Fig1-Abisiga-top-10-lists-data-science.jpg "Data Science")

## Interesting resources: 
- Python Basics
 - https://app.datacamp.com/learn/courses/intro-to-python-for-data-science
- DataFrame Basics
 - https://campus.datacamp.com/courses/introduction-to-data-science-in-python/loading-data-in-pandas?ex=1
 - https://campus.datacamp.com/courses/pandas-foundations/data-ingestion-inspection?ex=1 


## Intro Python, Pandas and DataFrames

**Pandas**
Pandas is a powerful data manipulation and analysis library for Python. It provides data structures like Series (one-dimensional) and DataFrame (two-dimensional) that make it easy to handle and analyze structured data. Pandas is particularly useful for data cleaning, transformation, and analysis tasks.

**Seaborn**
Seaborn is a Python data visualization library for drawing attractive and informative statistical graphics. Seaborn makes it easy to create complex visualizations with just a few lines of code, and it integrates well with Pandas DataFrames.

**DataFrames**
A DataFrame is a two-dimensional tabular data structure with labeled axes (rows and columns). Think of it as a table or a spreadsheet in Python. Each column in a DataFrame can be of a different data type (e.g., integers, floats, strings).

**Why DataFrames are Useful**
**Ease of Use:** DataFrames provide a simple and intuitive way to manipulate and analyze data.
**Data Cleaning:** They offer powerful tools for handling missing data, filtering, and transforming data.
**Integration:** DataFrames integrate seamlessly with other libraries like Seaborn for visualization and NumPy for numerical operations.
**Performance:** Pandas is optimized for performance, making it efficient for handling large datasets.

Import the libraries Panda and Seaborn:
You need to install these libraries if they’re not already included in your Anaconda environment.

Here’s how you can check and install them:

In [None]:
import pandas as pd
import seaborn as sns

If there’s no error, you're good to go!
If not, perform following command in a terminal:
```conda install pandas seaborn```
This will install both libraries using conda.
Now rerun the script above again.

The line ```iris = sns.load_dataset("iris")``` uses the Seaborn library to load the built-in "iris" dataset into the variable iris. This dataset contains measurements of 150 iris flowers from three species, including sepal and petal dimensions. It's a classic dataset often used for data visualization, statistical analysis, and machine learning examples.

In [None]:
iris = sns.load_dataset("iris")

In [None]:
iris

In [None]:
type(iris)

The following command displays the first five rows of the iris DataFrame. It's a quick way to get an overview of the data, including the column names and the first few entries. This is often used to inspect the structure and content of a dataset right after loading it.

In [None]:
iris.head()

![](https://miro.medium.com/max/1000/1*Hh53mOF4Xy4eORjLilKOwA.png "Iris dataset") 

In [None]:
iris.columns

In [None]:
iris.dtypes

In [None]:
print(iris.shape)
print(len(iris))
print(len(iris.columns))

In [None]:
iris['sepal_length']

In [None]:
type(iris['sepal_length'])

In [None]:
iris[['sepal_length','sepal_width']]

In [None]:
columns = ['sepal_length','sepal_width']
iris[columns]

In [None]:
type(columns)

The .loc[] method in pandas is used to select rows and columns by their labels. For example, iris.loc[0:4] returns the rows with labels 0 through 4, including 4. Unlike regular slicing, .loc[] is inclusive of the end label because it uses label-based indexing.

In [None]:
iris.loc[0:4] # inclusive! 0:4 are treated as labels, not indices!

[Good explanation about pandas](https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c)

In [None]:
iris.loc[0:4, columns]

In [None]:
versicolorFilter = iris['species'] == 'versicolor'
iris[versicolorFilter].head()

## Univariate analysis

Unvariate: Look at 1 column at a time.<br>
The type of univariate analysis that can be performed depends on the type of data the column has.
![](https://images.ctfassets.net/4e8xy1krjypg/A6Xf1MfISZhiQWuyGFDpV/b48be1afb29fcef49f596810281ba226/PillarPage-Qual-Quan-3.svg)

### Univariate analysis: Categorical data

Discrete values. Like enums.

In [None]:
iris['species']

In [None]:
iris['species'].unique()

In [None]:
iris['species'].value_counts()

In [None]:
iris['species'].value_counts().plot(kind='bar')

### Portfolio assignment 3
15 min: Perform a univariate analysis on all the categorical data of the penguins dataset. Commit the notebook to your portfolio when you're finished.
Optional: Start working on portfolio assignment 4 

In [None]:
penguins = sns.load_dataset("penguins")

In [None]:
penguins.head()

In [None]:
penguins['island'].value_counts(dropna=True).plot(kind='bar')

penguins['island'].value_counts(dropna=True).plot(kind='line')

### Portfolio assignment 4
15 min: Look online for a dataset that you personally find interesting to explore. It can be about any topic that you find interesting: sports, games, software development, etc. Commit the dataset to your portfolio. You will be analysing the dataset in future portfolio assignments.

Required characteristics of the dataset:
- Must be in a tabular format: Contains rows and columns
- Contains at least 100 rows
- Contains at least 2 columns with categorical data and at least 2 columns with numerical data
- Is circa 200 MB (this due restrictions of Brightspace)

Help needed to find a dataset? Reach out to your teacher they have a list of spare datasets.

![](https://i.imgur.com/0v1CGNV.png) <br>
Findings: ...<br>

### Univariate analysis: Numerical data

In [None]:
column = 'sepal_length'
iris[column].min()

In [None]:
iris[column].max()

![](https://d1e4pidl3fu268.cloudfront.net/233a08fc-072d-4601-9950-9ae9a6da8d7a/MeanmedianmdoeandrangepostersFrontcover.png)

In [None]:
iris[column].mean()

In [None]:
iris[column].median()

In [None]:
example_column1 = pd.Series([1,2,3,4, 1000])

This creates a Pandas Series, which is basically a 1-dimensional labeled array. It looks and behaves a bit like a column in a DataFrame.

In [None]:
print("example_column1:")
print(example_column1)
print("Mean: ",example_column1.mean())
print("Median: ", example_column1.median())

In [None]:
iris[column].std() #column = 'sepal_length'

![](https://www.wallstreetmojo.com/wp-content/uploads/2019/05/Standard-Deviation-Formula.jpg "Standard deviation")

In [None]:
example_column1.std()

But you can also do the calculation yourself:

In [None]:
 # calculate std manually
 (((example_column1 - example_column1.mean())**2).sum() / ( len(example_column1)-1))**0.5

In [None]:
example_column2 = pd.Series([1,1,1,1,1,1])
example_column3 = pd.Series([1,2,3,4,5,6])
example_column4 = pd.Series([1,2,3,4,5,100])
example_column5 = pd.Series([10,20,30,40,50,60])

In [None]:
# calculate the standard deviation of each of the columns
(example_column2.std(), example_column3.std(), example_column4.std(), example_column5.std()) 

In [None]:
#The data is grouped into 10 intervals (bins) and the number of data points in each interval is counted.
iris[column].plot(kind='hist', bins = 10)

This following script creates a boxplot (a.k.a. box-and-whisker plot) for one column of the iris dataset.
A boxplot is a standardized way of visualizing the distribution of a dataset, highlighting:
- Median (middle value)
- Quartiles (25th and 75th percentiles)
- Whiskers (range excluding outliers)
- Outliers (extreme values, shown as dots)

An outlier is a data point that is significantly different from the rest of the data.
It "sticks out" because it's much higher or lower than the majority of values.

Imagine you have these values (e.g., test scores):
```[80, 82, 78, 85, 79, 81, 400]```

Most scores are around 80... but then there's 400 — that’s an **outlier**.

In [None]:
iris[column].plot(kind='box')

In [None]:
example_column4.plot(kind='box')

In [None]:
# remove the outlier
example_column4[example_column4 < 10].plot(kind='box') 

![](https://miro.medium.com/max/1400/1*2c21SkzJMf3frPXPAR_gZA.png)

### Portfolio assignment 5
20 min: 
- Download lifeExpectancyAtBirth.csv from Brightspace ([original source](https://www.kaggle.com/utkarshxy/who-worldhealth-statistics-2020-complete?select=lifeExpectancyAtBirth.csv)).
- Move the file to the same folder as the Notebook that you will be working in.
- Load the dataset in your Notebook with the following code: lifeExpectancy = pd.read_csv('lifeExpectancyAtBirth.csv', sep=',')
- Look at the dataset with the .head() function.
- Filter the dataframe: We only want the life expectancy data about 2019 and 'Both sexes'
- Use this dataframe to perform a univariate analysis on the life expectancy in 2019.
- Which five countries have the highest life expectancy? Which five the lowest?

Commit the notebook and dataset to your portfolio when you're finished.

![](https://i.imgur.com/0v1CGNV.png) <br>
Findings: ...<br>

### Portfolio assignment 6
60 min: Perform a univariate analysis on at least 2 columns with categorical data and on at least 2 columns with numerical data in the dataset that you chose in portfolio assignment 4. Write down for each analysis an assumption and after the analysis your findings. Commit the Notebook to your portfolio when you're finished.

![](https://i.imgur.com/0v1CGNV.png)<br>
Assumption: ...<br>
Finding: ...<br>
