<img src="images/lasalle_logo.png" style="width:375px;height:110px;">
<p style=  "text-align: right; color: blue;"> WIM250 - Summer 2025</p>

# Week 7 - CSV files 

### WIM250 - Introduction to Scripting Languages 
### Instructor: Ivaldo Tributino

Sources:
    
- Automate The Boring Stuff With Python by AL Sweigart.
- Python for Everybody Exploring Data Using Python 3 by Dr. Charles R. Severance.
- Hands-on machine learning with scikit-learn and tensorflow by Géron, Aurélien.

## The CSV Module

A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields.

<img src="images/csv.png" style="width:500px;height:300px;">

Python’s` csv module` makes it easy to parse CSV files. Also, we can make use of a powerful library called pandas for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

Let's take a look at the file named `exaple.csv`. We will import the csv module, open and read the file.

In [None]:
import csv

In [None]:
# Use a context manager to open the CSV file safely
with open('example.csv', mode='r', newline='', encoding='utf-8') as exampleFile:
    # Create a CSV reader object
    exampleReader = csv.reader(exampleFile)

    # Convert the reader object to a list of rows
    exampleData = list(exampleReader)

# Display the entire data read from the CSV file
exampleData

Now that you have the CSV file as a list of lists, you can access the value at a particular row and column with the expression `exampleData[row][col]`.

In [None]:
exampleData[2][1] == 'Pears' # third row and 2 columns

After you import the csv module and make a reader object from the CSV file, you can loop through the rows in the reader object. Each row is a list of values, with each value representing a cell.

In [None]:
# Open the CSV file using a context manager to ensure it's properly closed after reading
with open('example.csv', mode='r', newline='', encoding='utf-8') as exampleFile:
    # Create a CSV reader object
    exampleReader = csv.reader(exampleFile)

    # Print the type of the reader object 
    print(f"Type of exampleReader: {type(exampleReader)}")

    # Iterate over each row in the CSV file
    for row in exampleReader:
        # Print the current line number and the row content
        print(f"Row #{exampleReader.line_num}: {row}")


### Pandas 

In computer programming, `pandas` is a software library written for the Python programming language `for data manipulation and analysis`. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. The name is derived from the term "panel data".

<img src="images/pandas.png" style="width:300px;height:300px;">

In [None]:
# import pandas


In [None]:
df_housing = pd.read_csv('housing.csv') # Data Frame

The `df.head()` This function returns the first `n` rows for the object based on position

In [None]:
print("Total Row Number: {0} \nTotal Columns Number: {1}".format(df_housing.shape[0], df_housing.shape[1]))

The `df.info()` method is useful to get a quick description of the data, in particular the total number of rows, and each attribute’s type and number of non-null values

The `df.hist()` method on the whole dataset, and it will plot a histogram for each numerical attribute

In [None]:
df_housing.hist() # see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html

In [None]:
# Matplotlib comprehensive library for creating static, animated, and interactive visualizations in Python.


Let's use the `value_counts()` in one of the columns to get a Series containing counts of unique values. 

In [None]:
# Create a Boolean mask for rows where 'housing_median_age' is greater than 50


# Use the mask to filter the DataFrame, then count the occurrences of each unique value in 'ocean_proximity'


# Display the result



In [None]:
# Create Boolean masks for filtering the DataFrame
mask1 =
mask2 = 

# Apply both masks using the operator (&) and count unique values in 'ocean_proximity'
ocean_proximity_counts = 

# Display the result
print(ocean_proximity_counts)


Pandas `dataframe.groupby()` function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. 

In [None]:
grouped_df = 


In [None]:
# Plot the average median house value for each ocean proximity category as a bar chart


### Box Plot with Seaborn

We often evaluate a property by its location. So let's investigate the price of the house by location using `box plot`.
A `box plot` shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the inter-quartile range.
for more information about boxenplot, see: https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51

To draw the `box plot`, we will use `Seaborn`. `Seaborn` is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

for more information about `Seaborn` see: https://seaborn.pydata.org/

In [None]:
import seaborn as sns

# Set the figure size for better readability of the plot


# Create a boxen plot to show the distribution of median house values across ocean proximity categories


# Add a title to the plot


# Display the plot



### Looking for Correlations

The correlation coefficient ranges from –1 to 1. When it is close to 1, it means that there is a strong positive correlation; for example, the median house value tends to go up when the median income goes up. When the coefficient is close to –1, it means that there is a strong negative correlation; you can see a small negative correlation between the latitude and the median house value (i.e., prices have a slight tendency to go down when you go north). Finally, coefficients close to zero mean that there is no linear correlation.

<img src="images/correlation.png" style="width:700px;height:200px;">

In [None]:
import numpy as np 

# Compute pairwise correlation of all numerical columns, excluding NA/null values
corr = df_housing.corr()

# Set up the figure size for better readability
plt.subplots(1, 1, figsize=(15, 10))

# Generate a mask for the upper triangle to avoid redundant information
mask = np.triu(np.ones_like(corr, dtype=bool))

# Create a heatmap of the correlation matrix with annotations and styling
sns.heatmap(corr, 
            annot=True,        # Show correlation coefficients
            mask=mask,         # Apply the upper triangle mask
            linewidths=1,      # Add lines between cells
            cmap="YlGnBu")     # Use a visually appealing color map

# Display the plot
plt.show()


### Feature Engineering

`Feature engineering` is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms.

**Let's divide the data in homogeneous subgroups:**
- The following code create an income category attribute by dividing the median income by 1.5 (to limit the number of income categories). 
- rounding up using ceil (to have discrete categories), and then merging all the categories greater than 5 into category 5.

**For this task it will be important to know DataFrame.where, so let's take a look at the syntax.**

```
Syntax:
DataFrame.where(cond, other=nan, inplace=False) 
 
Parameters:
cond: One or more condition to check data frame for.
other: Replace rows which don’t satisfy the condition with user defined object, Default is NaN
inplace: Boolean value, Makes changes in data frame itself if True
```

In [None]:
# Create a new column 'income_categorical' by dividing 'median_income' by 3 and applying the ceiling function
# This groups income into categories (e.g., 0–3, 3–6, etc.). Max median_income is 15, so 15/3 = 5


# Cap the income categories at 5.0 to avoid having values greater than 5


# Display the first few rows of the updated DataFrame



In [None]:
# How income influences the House median value.
import matplotlib.pyplot as plt

# Group the data by income category and calculate the mean of median house values
# Then plot the result as a bar chart
df_housing.groupby('income_categorical')['median_house_value'].mean().plot(


# Add a title to the plot for context


# Display the plot



### Missing Values in Data

It is important to understand the concept of missing values in order to successfully manage data. If the missing values are not handled properly by the researcher, he may end up making an inaccurate inference about the data. 

<img src="images/missing.png" style="width:550px;height:300px;">

In [None]:
# Creating a data frame to check the amount of missing values .
df_missing = 
df_missing