# Loading data from a csv file, and pre-processing data for further analysis

<br>

<div style="text-align: justify">In this notebook, we will see how to perform simple and essential tasks when starting data analysis with Python. A procedural approach which is more comprehensive for beginners is used, without using the <a href="https://pandas.pydata.org/">Pandas</a> and <a href="https://www.scipy.org/">SciPy</a> packages that are regularly used for data analysis and data scraping. This choice is done for better understanding of the different functions that are already implemented in these libraries, and to comprehend from scratch the programming behind. The different lines of code are well commented for the reader to understand. The first step in the current notebook is to import the different built-in modules in Python that are required to compute our code. </div> 

In [4]:
import csv # csv for reading the csv files
from math import sqrt # square root function from math module

<div style="text-align: justify">The data could be obtained by <a href="https://en.wikipedia.org/wiki/Data_scraping">data scrapping</a> from a website for instance or via an <a href="https://en.wikipedia.org/wiki/Application_programming_interface">Application Programming Interface (API)</a> . 

Here instead, we consider loading the data from a CSV file as it is one the most common data-exchange format. A <a href="https://en.wikipedia.org/wiki/Comma-separated_values">comma separated values</a> (CSV) file contains different values separated by a delimiter, which acts as a database table or an intermediate form of a database table. In other words, a CSV file file is a set of database rows and columns stored in a text file such that the rows are separated by a new line while the columns are separated by a semicolon or a comma. A CSV file is primarily used to transport data between two databases of different formats through a computer program.

The function <i>load_csv()</i> below, scrapes the data from a CSV file by reading the different rows of the file, and by storing the rows one by one in a list.</div>

In [10]:
## Load a CSV file
def load_csv(filename):
    dataset = list()  # creates a list where the row will be stored
    with open(filename, 'r') as file: # open the file in reading mode 
        csv_reader = csv.reader(file) # csv.reader built-in function
        for row in csv_reader: # loop on the rows of the file
            if not row:
                continue
            dataset.append(row) # adding row to the dataset
    return dataset

<div style="text-align: justify">The data acquired from the function above can be of different types. The data can be made of strings of characters or floating numbers for instance. When reading from a CSV file, the data is parsed as a string of character. However, in order to perform computational operations and analysis on the data, this requires that all string of numbers are transformed into floating numbers.</div>   
<br>

<div style="text-align: justify">The function <i>str_column_to_float()</i> below implements this operation for a column of the dataset. The function can then later be looped on the desired columns to turn strings into floating numbers. 
</div>

In [11]:
## Convert string column to float
def str_column_to_float(dataset, column):
    for row in dataset:
        row[column] = float(row[column].strip()) # strip() function strips all characters from the beginning and the end of the string (default whitespace characters)

<div style="text-align: justify">The function <i>str_column_to_float()</i> will usually be used to transform all the features of the dataset into floating numbers to perform different operations on them. However, when considering dataset for classification especially, different classes may be a string of characters as well. For instance, let's consider that in a group of person, these people are classified by age and we have 2 classes 'Below 30' and 'Above 30'. Operations won't be performed on these classes as they will just be used for comparison with predictions, for instance in a machine learning classification problem.</div>

<br>

<div style="text-align: justify">However, transforming the data from this column into integers, can ease the classification process. For example, the string 'Below 30' is replaced by the number 0 and 'Above 30' by 1. In this way, predictions from the features are easier to compare with the classification values.</div>

<br>

<div style="text-align: justify">The function <i>str_column_to_int()</i> below allows this conversion operation for a particular column. First, a list ('class_values') containing the different rows of the column is created. From this list, 'unique' is created which is a set looking at all unique values of the list 'class_values'. A dictionary ('lookup') is then initialised to match integer values to the string characters. The string characters can now be referenced with their integer values, which will be usually preferred for data analysis.</div>


In [12]:
## Convert string column to integer
def str_column_to_int(dataset, column):
    class_values = [row[column] for row in dataset] # comprehension list with data of the considered column stored
    unique = set(class_values) # set where all unique values are stored
    lookup = dict() #
    for i, value in enumerate(unique): # enumerate function (loop + automatic counter of values)
        lookup[value] = i # value is the key of dictionary (the string characters) and i is the number affiliated
    for row in dataset:  # modifying column in dataset
        row[column] = lookup[row[column]]
    return lookup

In [13]:
# Load iris dataset
filename = 'iris.csv'
dataset = load_csv(filename)

print('Loaded data file {0} with {1} rows and {2} columns'.format(filename, len(dataset), len(dataset[0])))
print('First row of the dataset: ', dataset[0]) # print first line of the dataset
print('--------------------------------------')

# convert string columns to float 
for i in range(len(dataset[0])-1): # loop on all columns
    str_column_to_float(dataset, i)
# convert class column to int
lookup = str_column_to_int(dataset, 4)

print('First row of modified dataset: ', dataset[0]) # print first line of updated dataset
print(lookup) # print lookup dictionary containing the classification of the species and their corresponding number

Loaded data file iris.csv with 150 rows and 5 columns
First row of the dataset:  ['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']
--------------------------------------
First row of modified dataset:  [5.1, 3.5, 1.4, 0.2, 2]
{'Iris-virginica': 0, 'Iris-versicolor': 1, 'Iris-setosa': 2}


In [44]:
##### Normalize Data ###########

# Find the min and max values for each column
def dataset_minmax(dataset):
    minmax = list()
    for i in range(len(dataset[0])):
        colvalues = [row[i] for row in dataset]
        min_value = min(colvalues) 
        max_value = max(colvalues)
        minmax.append([min_value, max_value])
    return minmax

# Normalize the dataset except last row for classification values
def Normalize_Dataset(dataset, minmax):
    for row in dataset:
        for i in range(len(row)-1):
            row[i] = (row[i] - minmax[i][0]) / (minmax[i][1] - minmax[i][0])

In [45]:
# Calculate min and max for each column
minmax = dataset_minmax(dataset)
# Normalize columns
Normalize_Dataset(dataset, minmax)
print('First row of normalized dataset: ', dataset[0])

First row of normalized dataset:  [0.22222222222222202, 0.6249999999999999, 0.06779661016949144, 0.041666666666666644, -1.2206555615733707]


In [55]:
#### Standardize Data ######

# Calculate column means
def column_means(dataset):
    means = [0 for i in range(len(dataset[0]))]
    for i in range(len(dataset[0])):
        col_values = [row[i] for row in dataset]
        means[i] = sum(col_values) / float(len(dataset))
    return means

# Calculate column standard deviations
def column_stdevs(dataset, means):
    stdevs = [0 for i in range(len(dataset[0]))]
    for i in range(len(dataset[0])):
        variance = [pow(row[i]-means[i], 2) for row in dataset]
        stdevs[i] = sum(variance)
        stdevs = [sqrt(x/(float(len(dataset)-1))) for x in stdevs]
    return stdevs

# Standardize the dataset
def Standardize_Dataset(dataset, means, stdevs):
    for row in dataset:
        for i in range(len(row)):
            row[i] = (row[i] - means[i]) / stdevs[i]

In [75]:
# Estimate mean and standard deviation
means = column_means(dataset)
stdevs = column_stdevs(dataset, means)
# standardize dataset
Standardize_Dataset(dataset, means, stdevs)
print(dataset[0])

[-9.179297646015436e+22, 98315463418529.08, -4160058.767277357, -194.9783686428172, -1.2206555615733707]
