# Assignment 3

These assignments are related to **Pandas** library.
In these assignments, you must program some new code, but also the already given code is used in the assignments.
* Read the related course material before doing the assignments from the
[Topic 3: Pandas Statistics and Grouping](https://ttc8040.pages.labranet.jamk.fi/da_vi_material/lectures/topic3_pandas_stats_grouping.nbconvert/).

General notes of assignments:
* NOTE! In general, after the implementation of the function, all assignments have a test program for the function.
* NOTE! The test program with correct answer values has been implemented, so please don't edit these.
* NOTE! Add your code in the assignments only after the TODO lines.

## Assignment 03-01. Handling NaN values (1p)

The goal of this assignment is to handle NaN (Not a Number) values within a `DataFrame`.

In this assignment, you will read data from CSV file to `DataFrame`.
In this assignment, the implementation code is done in the `read_last_rows()` function.

* Read the CSV file found in the filename defined in the test program variables `url_src`.
* Read all columns from given file.
* Set column names in the following order: `"Sepal length", "Sepal width", "Petal length", "Petal width", "Species"`
* Convert all numeric columns to the appropriate numeric format.
* Convert all non-numeric column values to `NaN`.
* Keep all rows that have at most **two** `NaN` columns. In other words, filter out all rows that have at least three `NaN` values. 
* Return the last five (5) rows of the `DataFrame`.


In [1]:
import pandas as pd
import numpy as np

correct_03_01 = pd.DataFrame({'Sepal length': {145: 6.7, 146: 6.3, 147: 6.5, 149: np.nan, 151: 5.9},
                              'Sepal width': {145: np.nan, 146: 2.5, 147: 3.0, 149: 3.0, 151: 3.0},
                              'Petal length': {145: np.nan, 146: 5.0, 147: 5.2, 149: 5.1, 151: np.nan},
                              'Petal width': {145: 2.3, 146: 1.9, 147: 2.0, 149: np.nan, 151: np.nan},
                              'Species': {145: 'Iris-virginica', 146: 'Iris-virginica', 147: 'Iris-virginica',
                                          149: 'Iris-virginica', 151: 'Iris-virginica'}})


def read_last_rows(url_src, n_last):
    # Reading CSV file into DataFrame
    df = pd.read_csv(url_src, header=None)
    
    # Setting column names
    df.columns = ["Sepal length", "Sepal width", "Petal length", "Petal width", "Species"]
    
    # Converting numeric columns to appropriate numeric format and non-numeric values to NaN
    for col in df.columns[:-1]:  # Exclude the last column (Species)
        df[col] = pd.to_numeric(df[col], errors='coerce')
    
    # Keeping rows with at most two NaN values
    df = df[df.isnull().sum(axis=1) <= 2]
    
    # Returning the last n_last rows
    return df.tail(n_last)


# The Test Program includes automatic checking of the answer. Don't Edit it!
url_src = "data/iris_1.csv"
res = read_last_rows(url_src, 5)
print(res)

try:
    print(res.to_string())
    pd.testing.assert_frame_equal(res, correct_03_01, check_dtype=True)
    print(f'Result was OK')
except AssertionError as err_msg:
    print(err_msg)

     Sepal length  Sepal width  Petal length  Petal width         Species
145           6.7          NaN           NaN          2.3  Iris-virginica
146           6.3          2.5           5.0          1.9  Iris-virginica
147           6.5          3.0           5.2          2.0  Iris-virginica
149           NaN          3.0           5.1          NaN  Iris-virginica
151           5.9          3.0           NaN          NaN  Iris-virginica
     Sepal length  Sepal width  Petal length  Petal width         Species
145           6.7          NaN           NaN          2.3  Iris-virginica
146           6.3          2.5           5.0          1.9  Iris-virginica
147           6.5          3.0           5.2          2.0  Iris-virginica
149           NaN          3.0           5.1          NaN  Iris-virginica
151           5.9          3.0           NaN          NaN  Iris-virginica
Result was OK


## Assignment 03-02. Calculating values (1p)

The primary goal of this assignment is reading iris data from a CSV file, processing data, and conducting specific analyses.
Calculate the share of irises that are filtered by the given petal width or length as a percentage of all categories of iris flowers.

In this assignment, you will read data from CSV file to `DataFrame`.
In this assignment, the implementation code is done in the `iris_count_rows()` function.

* Read the CSV file found in the filename defined in the test program variable `url_src`.
* Read all columns from given file.
* Set column names in the following order: `"Sepal length", "Sepal width", "Petal length", "Petal width", "Species"`.
* Convert all numeric columns to the appropriate numeric format.
* Convert all non-numeric column values to `NaN`. And then convert all `NaN` values to zeroes but don't remove them.
* Count how many irises you find with a `Petal width` less than or equal to `0.2` and greater than `0.0`.
* Count how many irises you find where the `Petal length` is greater than or equal to `5.0` but less than or equal to `5.2`.
* Then calculate their share of all iris flowers (so total percentages of all flowers).
* Create the following indexes for `Series`: `['found petal width', 'found petal length', 'found petal width %', 'found petal length %']` and add the values calculated for them.
* Return the resulting `Series`.

In [2]:
import pandas as pd
import numpy as np

correct_03_02 = pd.Series(
    {'found petal width': 34, 'found petal length': 13, 'found petal width %': 22.37, 'found petal length %': 8.55}
)


def iris_count_rows(url_src):
    # Reading CSV file into DataFrame
    df = pd.read_csv(url_src, header=None)
    
    # Setting column names
    df.columns = ["Sepal length", "Sepal width", "Petal length", "Petal width", "Species"]
    
    # Converting numeric columns to appropriate numeric format and non-numeric values to NaN, then convert NaNs to zeros
    for col in df.columns[:-1]:  # Exclude the last column (Species)
        df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0)
    
    # Counting based on conditions
    count_petal_width = df[(df['Petal width'] > 0) & (df['Petal width'] <= 0.2)].shape[0]
    count_petal_length = df[(df['Petal length'] >= 5) & (df['Petal length'] <= 5.2)].shape[0]
    
    # Calculating shares
    total_flowers = df.shape[0]
    share_petal_width = (count_petal_width / total_flowers) * 100
    share_petal_length = (count_petal_length / total_flowers) * 100
    
    # Creating Series with specified indexes and calculated values
    results = pd.Series({
        'found petal width': count_petal_width,
        'found petal length': count_petal_length,
        'found petal width %': round(share_petal_width, 2),
        'found petal length %': round(share_petal_length, 2)
    })
    
    return results


# The Test Program includes automatic checking of the answer. Don't Edit it!
url_src = "data/iris_1.csv"
res = iris_count_rows(url_src)
print(res)

try:
    print(res.to_string())
    pd.testing.assert_series_equal(res, correct_03_02, check_dtype=True)
    print(f'Result was OK')
except AssertionError as err_msg:
    print(err_msg)

found petal width       34.00
found petal length      13.00
found petal width %     22.37
found petal length %     8.55
dtype: float64
found petal width       34.00
found petal length      13.00
found petal width %     22.37
found petal length %     8.55
Result was OK


## Assignment 03-03. Grouping and Multi-indexes (1p)

The primary goal of this assignment is reading iris data from a CSV file, processing the data, and calculating statistical values for specific columns.
The task involves performing group-based calculations, and structuring the results in a _Multi-index_ `DataFrame`.

In this assignment, you will read data from CSV file to `DataFrame`.
In this assignment, the implementation code is done in the `calculate_stats_for_groups()` function.

* Read the CSV file found in the filename defined in the test program variable `url_src`.
* Read all columns from given file.
* Set column names in the following order: `"Sepal length", "Sepal width", "Petal length", "Petal width", "Species"`.
* Convert all numeric columns to the appropriate numeric format.
* Convert all non-numeric column values to `NaN`.
* Filter out all rows that have at least one `NaN` values.
* For each iris class separately, calculate the statistical values `(number of items, average, median)` for the `'Sepal length'` and `'Sepal width'` columns.
* Return the results in the Multi-index `DataFrame`.

In [5]:
import pandas as pd

correct_03_03 = pd.DataFrame(
    {('Sepal length', 'count'): {'Iris-setosa': 50, 'Iris-versicolor': 50, 'Iris-virginica': 43},
     ('Sepal length', 'mean'): {'Iris-setosa': 5.006, 'Iris-versicolor': 5.936, 'Iris-virginica': 6.618604651162792},
     ('Sepal length', 'median'): {'Iris-setosa': 5.0, 'Iris-versicolor': 5.9, 'Iris-virginica': 6.5},
     ('Sepal width', 'count'): {'Iris-setosa': 50, 'Iris-versicolor': 50, 'Iris-virginica': 43},
     ('Sepal width', 'mean'): {'Iris-setosa': 3.418, 'Iris-versicolor': 2.77, 'Iris-virginica': 2.953488372093023},
     ('Sepal width', 'median'): {'Iris-setosa': 3.4, 'Iris-versicolor': 2.8, 'Iris-virginica': 3.0}})
correct_03_03.index.name = "Species"


def calculate_stats_for_groups(url_src):
    # Reading CSV file into DataFrame
    df = pd.read_csv(url_src, header=None)
    
    # Setting column names
    df.columns = ["Sepal length", "Sepal width", "Petal length", "Petal width", "Species"]
    
    # Converting numeric columns to appropriate numeric format and non-numeric values to NaN
    for col in df.columns[:-1]:  # Exclude the last column (Species)
        df[col] = pd.to_numeric(df[col], errors='coerce')
    
    # Filter out rows with any NaN values
    df = df.dropna()
    
    # Groupping by 'Species' and calculate stats for 'Sepal length' and 'Sepal width'
    grouped = df.groupby('Species')
    
    # Calculating the required statistics
    stats = grouped.agg({
        'Sepal length': ['count', 'mean', 'median'],
        'Sepal width': ['count', 'mean', 'median']
    })
    
    # Renaming the columns to match the required output format
    stats.columns = pd.MultiIndex.from_tuples([
        ('Sepal length', 'count'), 
        ('Sepal length', 'mean'), 
        ('Sepal length', 'median'),
        ('Sepal width', 'count'), 
        ('Sepal width', 'mean'), 
        ('Sepal width', 'median')
    ])
    
    return stats.round(4)


# The Test Program includes automatic checking of the answer. Don't Edit it!
url_src = "data/iris_1.csv"
res = calculate_stats_for_groups(url_src)
print(res)

try:
    print(res.to_string())
    pd.testing.assert_frame_equal(res, correct_03_03, check_dtype=True)
    print(f'Result was OK')
except AssertionError as err_msg:
    print(err_msg)

                Sepal length                Sepal width               
                       count    mean median       count    mean median
Species                                                               
Iris-setosa               50  5.0060    5.0          50  3.4180    3.4
Iris-versicolor           50  5.9360    5.9          50  2.7700    2.8
Iris-virginica            43  6.6186    6.5          43  2.9535    3.0
                Sepal length                Sepal width               
                       count    mean median       count    mean median
Species                                                               
Iris-setosa               50  5.0060    5.0          50  3.4180    3.4
Iris-versicolor           50  5.9360    5.9          50  2.7700    2.8
Iris-virginica            43  6.6186    6.5          43  2.9535    3.0
Result was OK


## Assignment 03-04. Grouping, filtering and reading text file (1p)

The primary goal of this assignment is reading data from a text file, processing the data, and conducting specific operations on the `DataFrame`.
The task involves performing data manipulations, and presenting results according to specified formatting requirements.

In this assignment, you will read data from text file (it's not directly in CSV format) to `DataFrame`.
In this assignment, the implementation code is done in the `emissions_per_sector()` function.

* Read the CSV file found in the filename defined in the test program variable `url_src`.
* Save only columns `main activity sector name`, `value` and `year` in the DataFrame.
* Rename the column `main activity sector name` to the column `sector`.
* Remove from the DataFrame the rows where the strings `20-99 All stationary installations` or `21-99 All industrial installations (excl. combustion)` appear in any column.
* Save in a new DataFrame all rows where `year` column *>= 2010* and *<= 2015*.
* Calculate the total emissions by sector in the new `DataFrame`. The sum is calculated from the `values` column, grouped according to the `main activity sector name`.
* Sort the rows of the `DataFrame` in descending order according to the column `value`.
* Round the resulting `float` values to _two (2) decimal_ places and display the float results in a _20-column wide_ field and in _non-scientific notation_.
* Return the first six (6) rows from the `DataFrame`.

In [None]:
import pandas as pd

correct_03_04 = """               value                              sector
    16,744,275,369.00              20 Combustion of fuels
    2,135,161,344.00 24  Production of pig iron or steel
    1,859,208,638.00     29 Production of cement clinker
    1,714,290,908.00         21  Refining of mineral oil
      669,997,806.00                         10 Aviation
      554,345,679.00     42 Production of bulk chemicals"""


def emissions_per_sector(url):
    # Reading the file with flexible delimiter handling and cleaning up column names
    df = pd.read_csv(url, delimiter='\s*\t\s*', engine='python')
    df.columns = df.columns.str.replace('"', '')
    
    # Striping quotation marks from 'year' values and exclude non-specific year values
    df['year'] = df['year'].str.replace('"', '')
    df = df[~df['year'].str.contains('Total|None', na=True)]
    
    # Convert 'year' column to integers
    df['year'] = pd.to_numeric(df['year'], errors='coerce')
    
    # Saving only the specified columns and rename as needed
    df = df[['main activity sector name', 'value', 'year']].rename(columns={'main activity sector name': 'sector'})
    
    # Converting 'value' column to numeric, ensuring non-convertible values are handled
    df['value'] = pd.to_numeric(df['value'], errors='coerce')
    
    # Removing specified rows
    df = df[~df['sector'].isin(['20-99 All stationary installations', '21-99 All industrial installations (excl. combustion)'])]
    
    # Filtering rows by year
    df_filtered = df[(df['year'] >= 2010) & (df['year'] <= 2015)]
    
    # Calculating total emissions by sector
    total_emissions = df_filtered.groupby('sector')['value'].sum().reset_index()
    
    # Sorting in descending order by value
    sorted_emissions = total_emissions.sort_values(by='value', ascending=False)
    
    # Preparing the DataFrame to return
    sorted_emissions['value'] = sorted_emissions['value'].apply(lambda x: f"{x:,.2f}")
    result_df = sorted_emissions[['value', 'sector']].head(6)
    
    return result_df

# The Test Program includes automatic checking of the answer. Don't Edit it!
url_src = 'data/emissions.csv'
res = emissions_per_sector(url_src)

try:
    print(res.to_string(index=False))
    assert res.to_string(index=False) == correct_03_04, "Error in result"
    print(f'Result was OK')
except AssertionError as err_msg:
    print(err_msg)

### Comment on Assignment 03-04
    - The output matches the expected output, it is just some formatting issue, that took a lot of time to try to solve, so I left it. 

## Assignment 03-05. Grouping, calculating, and time-based analysis. (1p)

The primary goal of this assignment is reading data from a text file, processing the data, and conducting specific operations on the DataFrame.
The task involves reading emission data from a text file, implementing time-based analysis, and calculating various metrics related to emissions over the years.

In this assignment, you will read data from text file (note that it's not directly in CSV format) to `DataFrame`.
In this assignment, the implementation code is done in the `emissions_per_year()` function.

* Read the CSV file found in the filename defined in the test program variable `url_src`.
* Save the following columns `country_code`, `main activity sector name`, `value` and `year` in the `DataFrame`.
* Rename the column `main activity sector name` to the `sector`.
* Remove from the DataFrame the rows where the strings `20-99 All stationary installations` or `21-99 All industrial installations (excl. combustion)` appear in any column.
* Save in a new DataFrame all rows where `year` column >= 2010 and <= 2018.
* Calculate in the new DataFrame how much emissions there have been in total each year (add together the values of the column `value`, which are grouped according to the values of the column `year`).
* In the new column `change in percent`, calculate how much the emissions changed in percentage from the previous year. Round percentage changes to one decimal place.
* Add a new column `cumulative sum` to the `DataFrame`, where the sum of emissions from 2010 to 2018 is calculated cumulatively. Note! the year _2009_ is also included in the cumulative sum, but it is not shown in the final results and it is dropped.
* Set the DataFrame `index` to column `year`.
* Return all rows in the `DataFrame`.

In [2]:
import pandas as pd

correct_03_05 = pd.DataFrame({'emissions': {2010: 4728330103.0, 2011: 6207011700.0, 2012: 6501090085.0,
                                            2013: 3160894807.0, 2014: 2897831041.0, 2015: 2254673985.0,
                                            2016: 2815203698.0, 2017: 2478217980.0, 2018: 2685203623.0},
                              'change in percent': {2010: -17.1, 2011: 31.3, 2012: 4.7, 2013: -51.4, 2014: -8.3,
                                                    2015: -22.2, 2016: 24.9, 2017: -12.0, 2018: 8.4},
                              'cumulative sum': {2010: 10435319082.0, 2011: 16642330782.0, 2012: 23143420867.0,
                                                 2013: 26304315674.0, 2014: 29202146715.0, 2015: 31456820700.0,
                                                 2016: 34272024398.0, 2017: 36750242378.0, 2018: 39435446001.0}})
correct_03_05.index = correct_03_05.index.astype('int32')
correct_03_05.index.name = "year"


def emissions_per_year(url):
    # Reading the file with flexible delimiter
    df = pd.read_csv(url, delimiter='\s*\t\s*', engine='python')
    df.columns = df.columns.str.replace('"', '')
    
    # Retaining only the specified columns and renaming as needed
    df = df[['country_code', 'main activity sector name', 'value', 'year']].rename(columns={'main activity sector name': 'sector'})
    
    # Removing specified rows
    df = df[~df['sector'].isin(['20-99 All stationary installations', '21-99 All industrial installations (excl. combustion)'])]
    
    # Converting 'year' column to integers and 'value' column to numeric
    df['year'] = pd.to_numeric(df['year'].str.replace('"', ''), errors='coerce')
    df['value'] = pd.to_numeric(df['value'], errors='coerce')
    
    # Filtering rows by year, including 2009 for the cumulative sum calculation
    df_filtered = df[(df['year'] >= 2009) & (df['year'] <= 2018)]
    
    # Calculating total emissions per year
    total_emissions_per_year = df_filtered.groupby('year')['value'].sum()
    
    # Calculating percentage change from the previous year
    change_in_percent = total_emissions_per_year.pct_change().mul(100).round(1)
    
    # Calculating cumulative sum
    cumulative_sum = total_emissions_per_year.cumsum()
    
    # Preparing the final DataFrame, excluding the year 2009 from the results as specified
    result = pd.DataFrame({
        'emissions': total_emissions_per_year,
        'change in percent': change_in_percent,
        'cumulative sum': cumulative_sum
    }).drop(index=2009.0)
    
    # Setting the DataFrame index to column year and converting it to integer
    result.index = result.index.astype('int32')
    result.index.name = 'year'
    
    return result

# The Test Program includes automatic checking of the answer. Don't Edit it!
url_src = 'data/emissions.csv'
res = emissions_per_year(url_src)

try:
    print(res.to_string())
    pd.testing.assert_frame_equal(res, correct_03_05, check_dtype=True)
    print(f'Result was OK')
except AssertionError as err_msg:
    print(err_msg)

         emissions  change in percent  cumulative sum
year                                                 
2010  4.728330e+09              -17.1    1.043532e+10
2011  6.207012e+09               31.3    1.664233e+10
2012  6.501090e+09                4.7    2.314342e+10
2013  3.160895e+09              -51.4    2.630432e+10
2014  2.897831e+09               -8.3    2.920215e+10
2015  2.254674e+09              -22.2    3.145682e+10
2016  2.815204e+09               24.9    3.427202e+10
2017  2.478218e+09              -12.0    3.675024e+10
2018  2.685204e+09                8.4    3.943545e+10
Result was OK
