# 10. Data analysis

It is time to combine everything we have learned so far, to do a data analysis - an essential task for data science.

Here is the table of contents for this notebook:

- 10.1 Describing the dataset
- 10.2 Calculating the mean of the value column
- 10.3 Improving the implementation
- 10.4 Exercises

## 10.1 Describing the dataset

Let's `open` and `print` the all the lines of `data.csv`:

In [3]:
fhand = open('data.csv')
for line in fhand:
    print(line)
fhand.close()

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

This dataset is part of the Sustainable Development Goals (SDG). There are 4 columns. The first column is `indicator`. This indicator is part of **Goal 7**:

>Ensure access to affordable, reliable, sustainable and modern energy for all

and **indicator 7.1.2** is the:

>Proportion of population with primary reliance on clean fuels and technologies for cooking (%)

second column is `geoareaname` which is short for Geographical Area Name. The `timeperiod` shows when the indicator is measured and `value` column is the value of the indicator in percentages. Values higher than 95% are shown as `>95`. Also there are values less than 5% represented by `<5`. For our analysis we will assume >95% to be 95% and <5% to be 5%.

## 10.2 Calculating the mean of the value column

Let's find the mean of the `value` column.

But all we have is the whole line, how can we access the value? Let's look at the second line (first line is column names) to figure out how we can extract the value.

In [2]:
fhand = open('data.csv')
count = 0
for line in fhand:
    print(line)
    count = count + 1
    if count == 2:
        break
fhand.close()

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

How can we extract the number 52 from this line?

In [1]:
line

NameError: name 'line' is not defined

We know that the values are at the end so perhaps we can slice the line.

**Exercise 10.1**

Slice the line to get `52` from `line`

In [4]:
# YOUR CODE HERE

Hopefully you managed to do it. But we want an approach that works for _all_ lines. So let's _test_ it for all lines:

In [5]:
fhand = open('data.csv')
for line in fhand:
    print(line[-3:-1]) # Exercise 10.1
fhand.close()

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

It seems like we have a problem 😱 Our genious approach that worked for a single case, did not generalize to all cases 😢

It worked for most cases, but that is not enough. Let's see the cases where it fails:

- Case 1: First line is column names, so it does not contain a value
- Case 2: Single digit numbers (`;7`, `<5`)

When we designed our algorithm, our example line had a 2-digit number. But as we have just seen that is not always the case. You will encounter this concept over and over again: when you are programming, you have to consider possible cases as much as you can.

Considering all cases when writing a program in Python is essential for producing reliable, efficient, and bug-free software. By covering a wide range of scenarios, including normal, boundary, and _edge cases_, developers can enhance the program's functionality, handle unexpected situations gracefully, and provide a better user experience. Thorough testing and validation with various input values help identify and resolve potential issues, ensuring that the program performs as expected in any given situation.

It is hard to foresee all the cases, therefore _testing_ your code is key. In fact, that is how we discovered the error in our code above.

- We developed an approach for a single case
- Tested it in all lines
- Detected failure in some cases.

### 🐍 Advanced 🐍

This is called _unit testing_. It is a _unit_, because we tested a small piece (extracting a value from a line) of the source code. You can test your code in many different levels from testing units to testing systems. We are not going to cover unit testing in this block. If you would like to learn more, take a look at the following tutorial:

[Python Tutorial: Unit Testing Your Code with the unittest Module](https://www.youtube.com/watch?v=6tNS--WetLI)

Now, let's continue and come up with an approach that would work for all cases.

We can handle the first case by skipping the first line, as follows:

In [6]:
fhand = open('data.csv')
count = 0
for line in fhand:
    # Skip the first line
    if count == 0:
        count += 1
        continue

    print(line[-3:-1])
    count = count + 1

fhand.close()

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

For the second case, we can check if the string can be converted to an integer or not and catch these cases with `try`/`except`

In [7]:
int(';7')

ValueError: invalid literal for int() with base 10: ';7'

In [8]:
try:
    int(';7')
    print('Double digit')
except:
    print('Single digit')

Single digit


Let's add this logic into our program.

In [9]:
fhand = open('data.csv')
count = 0
for line in fhand:
    # Skip the first line
    if count == 0:
        count += 1
        continue
    
    # Slice the number
    number = line[-3:-1]
    
    try:
        # For two digit numbers this should work
        # It will raise an exception for single digit numbers, except block will run
        number = int(number)
        print(number)
    except:
        # If this code block is executed
        # That means we have a single digit number
        # We can slice accordingly
        number = line[-2:-1]
        number = int(number)
        print(number)
    
    count = count + 1

fhand.close()

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

It worked 🥳 To calculate the mean of the column, we need to save these numbers into a data structure (e.g. a list) and calculate their mean.

In [10]:
fhand = open('data.csv')
numbers = [] # create an empty list
count = 0
for line in fhand:
    # Skip the first line
    if count == 0:
        count += 1
        continue
    
    # Slice the number
    number = line[-3:-1]
    
    try:
        # For two digit numbers this should work
        # It will raise an exception for single digit numbers, except block will run
        number = int(number)
        numbers.append(number) # append to list instead of printing
    except:
        # If this code block is executed
        # That means we have a single digit number
        # We can slice accordingly
        number = line[-2:-1]
        number = int(number)
        numbers.append(number) # append to list instead of printing
print(sum(numbers)/len(numbers))
fhand.close()

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

That's it! 

Average of the proportion of population with primary reliance on clean fuels and technologies for cooking is 63.5%.

## 10.3 Improving the implementation

Let's remember the Zen of Python:

In [None]:
import this

Let's see if we can make our code more _Pythonic_.

>Exploiting the features of the Python language to produce code that is clear, concise and maintainable. Pythonic means code that doesn't just get the syntax right, but that follows the conventions of the Python community and uses the language in the way it is intended to be used. [Source](https://stackoverflow.com/a/25011492).

Let's look at the first line again.

In [11]:
fhand = open('data.csv')
count = 0
for line in fhand:
    print(line)
    count = count + 1
    if count == 2:
        break
fhand.close()

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

In [12]:
line

NameError: name 'line' is not defined

Whenever you are working with strings, try to remember and use the string methods. We have learned about `strip` and `split` methods.

**Exercise 10.2**

Use the `strip` method to get rid of the newline character `\n`

In [13]:
line = '7.1.2;Bhutan;2015;52\n'
# YOUR CODE HERE

**Exercise 10.3**

Use the `split` method to split the remaining string. Use the delimiter `;`.

The result should be
`['7.1.2', 'Bhutan', '2015', '52']`


In [14]:
line = '7.1.2;Bhutan;2015;52' # line after exercise 10.2
# YOUR CODE HERE

**Exercise 10.4**

Access the value we are looking for (i.e. 52) from the list `['7.1.2', 'Bhutan', '2015', '52']`

In [15]:
line = ['7.1.2', 'Bhutan', '2015', '52']
# YOUR CODE HERE

This might look more complex than slicing the string, but this approach will cover more cases. In other words, slicing assumed a two-digit number in a fixed location, this code will work regardless of the number of digits. Additionally, hardcoding a slice with indices (e.g. `line[-3:-1]`) is prone to slight differences in lines.

In [16]:
fhand = open('data.csv')
for line in fhand:
    line = line.strip() # Exercise 10.2
    line = line.split(';') # Exercise 10.3
    number = line[-1] # Exercise 10.4
    print(number)
fhand.close()

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

Now we still have the same problem with the first line, additionally we have to get rid of the `<` and `>` operators. We can use the `replace` method for that.

**Exercise 10.5**

Use the replace method to eliminate the greater operator (>).

In [17]:
number = '>95'
# YOUR CODE HERE

**Exercise 10.6**

Use the replace method to eliminate the smaller operator (<).

In [18]:
number = '<5'
# YOUR CODE HERE

Note that `replace` won't change the string if the logical operators are not in the string. So we can apply these to all numbers.

In [19]:
fhand = open('data.csv')
for line in fhand:
    line = line.strip() # Exercise 10.2
    line = line.split(';') # Exercise 10.3
    number = line[-1] # Exercise 10.4
    number = number.replace('>', '') # Exercise 10.5
    number = number.replace('<', '') # Exercise 10.6
    print(number)
fhand.close()

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

We can make the code still more concise by chaining methods

In [20]:
fhand = open('data.csv')
for line in fhand:
    number = line.strip().split(';')[-1] # Exercises 10.2-4
    number = number.replace('>', '').replace('<', '') # Exercises 10.5-6
    print(number)
fhand.close()

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

let's add skipping the first line, by using a slightly different logic, instead of counting.

In [21]:
fhand = open('data.csv')
for line in fhand:
    number = line.strip().split(';')[-1] # Exercises 10.2-4
    if number == 'value':continue # new logic for skipping the first line
    number = number.replace('>', '').replace('<', '') # Exercises 10.5-6
    print(number)
fhand.close()

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

That's it! Finally we create a list, and calculate the average.

In [22]:
fhand = open('data.csv')
numbers = []
for line in fhand:
    number = line.strip().split(';')[-1] # Exercises 10.2-4
    if number == 'value':continue # new logic for skipping the first line
    number = number.replace('>', '').replace('<', '') # Exercises 10.5-6
    numbers.append(int(number))
print(sum(numbers)/len(numbers))
fhand.close()

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

As you can see this code is much better than the previous version

```python
fhand = open('data.csv')
numbers = []
count = 0
for line in fhand:
    if count == 0:
        count += 1
        continue
    number = line[-3:-1]
    
    try:
    
        number = int(number)
        numbers.append(number)
    except:
        
        number = line[-2:-1]
        number = int(number)
        numbers.append(number)
print(sum(numbers)/len(numbers))
fhand.close()
```

This is nice but did you notice anything else? The previous result was 63.54% but with the second implementation we got 63.94% 😱😱😱😱

This is a major problem, we only changed our approach for extracting the values which shouldn't change the average. Discovering the issue is left for you as an exercise.

## 10.4 Exercises

**Exercise 10.7**

Investigate why these two implementations that are intended to calculate the same statistic, give different results. Determine which implementation is correct and which implementation is wrong.

In [23]:
# Implementation 1
fhand = open('data.csv')
numbers = []
count = 0
for line in fhand:
    if count == 0:
        count += 1
        continue
    number = line[-3:-1]
    try:
        number = int(number)
        numbers.append(number)
    except:
        number = line[-2:-1]
        number = int(number)
        numbers.append(number)
print(sum(numbers)/len(numbers))
fhand.close()

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

In [None]:
# Implementation 2
fhand = open('data.csv')
numbers = []
for line in fhand:
    number = line.strip().split(';')[-1] # Exercises 10.2-4
    if number == 'value':continue # new logic for skipping the first line
    number = number.replace('>', '').replace('<', '') # Exercises 10.5-6
    numbers.append(int(number))
print(sum(numbers)/len(numbers))
fhand.close()

**Exercise 10.8**

Fix the wrong implementation.

In [None]:
fhand = open('data.csv')
numbers = []
for line in fhand:
    number = line.strip().split(';')[-1]
    if number == 'value':
        continue
    number = number.replace('>', '').replace('<', '')
    numbers.append(int(number))
print(sum(numbers)/len(numbers))
fhand.close()


**Exercise 10.9**

Find the geographical areas with proportion of population with primary reliance on clean fuels and technologies for cooking (%) less than or equal to 5 %.

Expected result

`['Rwanda',
 'Malawi',
 'Nigeria',
 'Guinea',
 'Niger',
 'Myanmar',
 'Sierra Leone',
 'South Sudan',
 'Madagascar',
 'Chad']`