<img src="images/lasalle_logo.png" style="width:375px;height:110px;">

# Week 8 - CSV files 

### WIM250 - Introduction to Scripting Languages 
### Instructor: Ivaldo Tributino

Sources:
    
- Automate The Boring Stuff With Python by AL Sweigart.
- Python for Everybody Exploring Data Using Python 3 by Dr. Charles R. Severance.
- Hands-on machine learning with scikit-learn and tensorflow by Géron, Aurélien.

## The CSV Module

A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields.

<img src="images/csv.png" style="width:500px;height:300px;">

Python’s` csv module` makes it easy to parse CSV files. Also, we can make use of a powerful library called pandas for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

Let's take a look at the file named `exaple.csv`. We will import the csv module, open and read the file.

In [None]:
import csv

In [None]:
exampleFile = open('example.csv')       # TextIOWrapper
exampleReader = csv.reader(exampleFile) # csv.reader
exampleData = list(exampleReader)
exampleData                             # List

Now that you have the CSV file as a list of lists, you can access the value at a particular row and column with the expression `exampleData[row][col]`.

In [None]:
exampleData[2][1] == 'Pears' # third row and 2 columns

After you import the csv module and make a reader object from the CSV file, you can loop through the rows in the reader object. Each row is a list of values, with each value representing a cell.

In [None]:
exampleFile = open('example.csv')       # TextIOWrapper
exampleReader = csv.reader(exampleFile) # csv.reader

print(type(exampleReader))

for row in exampleReader:
        print('Row #' + str(exampleReader.line_num) + ' ' + str(row))

### Pandas 

In computer programming, `pandas` is a software library written for the Python programming language `for data manipulation and analysis`. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. The name is derived from the term "panel data".

<img src="images/pandas.png" style="width:300px;height:300px;">

In [None]:
import pandas as pd

In [None]:
df_housing = pd.read_csv('housing.csv') # Data Frame
type(df_housing)

The `df.head()` This function returns the first `n` rows for the object based on position

In [None]:
df_housing.head() # default 5

In [None]:
print("Total Row Number: {0} \nTotal Columns Number: {1}".format(df_housing.shape[0], df_housing.shape[1]))

The `df.info()` method is useful to get a quick description of the data, in particular the total number of rows, and each attribute’s type and number of non-null values

In [None]:
df_housing.info()

The `df.hist()` method on the whole dataset, and it will plot a histogram for each numerical attribute

In [None]:
# df_housing.hist() # see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html

In [None]:
# Matplotlib comprehensive library for creating static, animated, and interactive visualizations in Python.
import matplotlib.pyplot as plt 
df_housing.hist(bins=50, figsize=(20,15))
plt.show()

Let's use the `value_counts()` in one of the columns to get a Series containing counts of unique values. 

In [None]:
# Selecting rows by Boolean indexing
mask = df_housing['housing_median_age']>50 # Boolean masks 
# value_counts() function returns object containing counts of unique values
df_housing[mask]['ocean_proximity'].value_counts()

In [None]:
mask1 = df_housing['median_house_value']>500000
mask2 = df_housing['housing_median_age']>50
df_housing[(mask1) & (mask2)]['ocean_proximity'].value_counts()

Pandas `dataframe.groupby()` function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. 

In [None]:
df = df_housing.groupby(['ocean_proximity']) # Data Frame
df.head()
serieGB = df['median_house_value']
# # # # list(serieGB)
serieGB.median()   # The Median is the "middle" of a sorted list of numbers. middle number in a list of numbers
# serieGB.median().plot(kind='bar') 


In [None]:
# Let's join the above commands in a single line.
df_housing.groupby(['ocean_proximity'])['median_house_value'].median().plot(kind='bar')

We often evaluate a property by its location. So let's investigate the price of the house by location using `box plot`.
A `box plot` shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the inter-quartile range.
for more information about boxenplot, see: https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51

To draw the `box plot`, we will use `Seaborn`. `Seaborn` is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

for more information about `Seaborn` see: https://seaborn.pydata.org/

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt 
# O box-plot 
plt.subplots(figsize=(12, 5))
sns.boxenplot(x ='ocean_proximity', y ='median_house_value', data=df_housing)
plt.title('Ocean Proximity vs House Value')
plt.show()

### Looking for Correlations

The correlation coefficient ranges from –1 to 1. When it is close to 1, it means that there is a strong positive correlation; for example, the median house value tends to go up when the median income goes up. When the coefficient is close to –1, it means that there is a strong negative correlation; you can see a small negative correlation between the latitude and the median house value (i.e., prices have a slight tendency to go down when you go north). Finally, coefficients close to zero mean that there is no linear correlation.

<img src="images/correlation.png" style="width:700px;height:200px;">

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np 

corr = df_housing.corr() # df.corr() Compute pairwise correlation of columns, excluding NA/null values.
plt.subplots(1,1,figsize=(15,10))
# # Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# sns.heatmap() Plot rectangular data as a color-encoded matrix.
# sns.heatmap(curr, annot=True,linewidths=1,cmap="YlGnBu")
sns.heatmap(corr, annot=True, mask = mask,linewidths=1,cmap="YlGnBu")
plt.show()

### Feature Engineering

`Feature engineering` is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms.

**Let's divide the data in homogeneous subgroups:**
- The following code create an income category attribute by dividing the median income by 1.5 (to limit the number of income categories). 
- rounding up using ceil (to have discrete categories), and then merging all the categories greater than 5 into category 5.

**For this task it will be important to know DataFrame.where, so let's take a look at the syntax.**

```
Syntax:
DataFrame.where(cond, other=nan, inplace=False) 
 
Parameters:
cond: One or more condition to check data frame for.
other: Replace rows which don’t satisfy the condition with user defined object, Default is NaN
inplace: Boolean value, Makes changes in data frame itself if True
```

In [None]:
import numpy as np

# The ceil of the scalar x is the smallest integer i, such that i >= x.
df_housing["income_cat"] = np.ceil(df_housing["median_income"]/1.5)  # the max median_income is 15 so 15/1.5=10 

df_housing["income_cat"].where(df_housing["income_cat"] < 5, 5.0, inplace=True) # If income is more than 5 define 5.
df_housing.head()

In [None]:
# How income influences the House median value.
df_housing.groupby(['income_cat'])['median_house_value'].mean().plot(kind='bar', figsize=(8,5))
plt.title('Income_cat vs House median value')
plt.show()

### Missing Values in Data

It is important to understand the concept of missing values in order to successfully manage data. If the missing values are not handled properly by the researcher, he may end up making an inaccurate inference about the data. 

<img src="images/missing.png" style="width:550px;height:300px;">

In [None]:
# Creating a data frame to check the amount of missing values .
df_missing = df_housing.isnull().sum(axis=0).reset_index()
df_missing