*General hints:* <br>
* You may use another notebook to test different approaches and ideas. When complete and mature, turn your code snippets into the requested functions in this notebook for submission. 
* Make sure the function implementations are generic and can be applied to any dataset (not just the one provided).
* Add explanatory code comments in the code cells. Make sure that these comments improve our understanding of your implementation decisions.

-----
* Create a variable holding your student id, as shown below. 
* Simply replace the example (`01234567`) with your actual student id having a total of 8 digits. 
* Maintain the variable as a string, do NOT change its type in this notebook!
* *Note: If your student id has 7 digits, add a leading 0. The final student id MUST have 8 digits!*

In [2]:
import pandas as pd

mn = '12209427'

## 0. Import

Implement a function `tidy` which imports the data set assigned and provided to you as a CSV file into a `pandas` dataframe. Access the data set and establish whether your data set is tidy. If not, clean the data set before continuing with Step 1. Mind all rules of tidying data sets in this step. Make sure you comply to the following statements:
* If there is an index column (row numbers) in your tidied dataset, keep it.
* The following columns, once identified, correspond to variables 1:1 (no need for transformations):
  * `full_name`
  * `automotive`
  * `color`
  * `job`
  * `address`
  * `coordinates`
* The tidied dataset should have a total of 8 columns (not including the index), the first column should be `full_name`.
* Mind the intended content of each attribute (e.g. full_name should contain the full name of a person, no need to change that)
* If tidy or done, have the function `tidy` return the ready data set as a dataframe.

Note that `tidy` must take a single parameter that holds the basename of the CSV file (i.e., the name without file extension). Do NOT change the name of the file, do not overwrite the original data file, and make sure you submit your final ZIP following the [Code of Conduct](https://datascience.ai.wu.ac.at/ws21/dataprocessing1/code_of_conduct.html) requirements. Especially, make sure you put your data file in a folder called `data/` when submitting.

In [31]:

mn = '12209427'

# Function to split the merged values into separate
def split_digit_text(value):
    # Iterate over the string backwards + find the first occurrence where a digit is. This helps to separate because it corresponds to the structure of the values 
    for i in range(len(value), 1, -1):
        if value[i-1].isdigit():
            return value[:i], value[i:]  # Return the split parts as a tuple
    for element in ["None", "NaN", "NA", "-inf"]:
        if element in value:
            value = value.replace(element, "")
    return value, None  # Return the whole string and None if no splitting point is found

def split_digit_text2(value):
    for element in ["None", "NaN", "NA", "-inf"]:
        if element in value:
            value = value.replace(element, "")
            if value[0].isalpha():
                return None, value
            if value[0].isdigit():
                return value, None
    if len(value) > 26:
        return value[:26], value[26:]
    if value[0].isalpha():
        return None, value
    return value, None

print(split_digit_text2("1994-06-23 00:06:06.885525CalvinHarris"))

def tidy(m_n):
    
    file_path = f"data/{m_n}.csv"
    df = pd.read_csv(file_path)
    
    column_names = ["full_name", "automotive", "color", "job", "address", "coordinates"]
    # Save temporary first_column values of the csv file
    first_col = df[df.columns[0]]

    
    # Check if the column names given are in the first column using not .isdisjoint(). True if value is in both sets (-> wrong orientation) and then transpose df
    if not set(column_names).isdisjoint(first_col):
        # transpose df
        df = df.transpose()
        # assign the column values to the values found in the first column before transposing 
        df.columns = first_col
        # remove the unnecessary first row (same as the headers)
        df = df.drop(df.index[0])

    # Check if dataset has 8 columns 
    if len(df.columns) < 8:
        # Find the faulty column
        special_col = set(df.columns).difference(set(column_names))
        # If there is only one special column transform it into a string
        if len(special_col) == 1:
            special_col = special_col.pop()
        # Separate it by it delimiter
        two_col = special_col.split("/")
        # Create two new df columns with the seperated column names and apply a function split_digit_text to separate all the values and assign it correspondingly 
        df[two_col[0]], df[two_col[1]] = zip(*df[special_col].apply(split_digit_text2))
        # Drop the faulty column
        df.drop(columns=special_col, inplace=True)
        
        
    df.to_csv("test2.csv", index=False, header=True)
        

    
    #raise NotImplementedError()
    
tidy(mn)

('1994-06-23 00:06:06.885525', 'CalvinHarris')


In [14]:
from nose.tools import assert_equal
import pandas
assert_equal(type(tidy(mn)), pandas.core.frame.DataFrame)
assert_equal(len((tidy(mn)).columns), 8)
assert_equal(list((tidy(mn)).columns)[0], "full_name")


FileNotFoundError: [Errno 2] No such file or directory: '12209427'

-------
## 1. Missing values

### 1.1 Code part
Implement a function called `missing_values` which takes as an input a dataframe and check if there are any missing values in the dataset. Record the row ids of the observations containing missing values as a list of numbers and make sure that the function returns the recorded list in the end. If there are no missing values, `missing_values` should return an empty list.

NOTE: Try to find out how missing values are incoded in your datasest and which missing values occur in your dataset by manual inspection, but at least test for the following: `"nan"`,`"NA"`,`"-inf"`,`"inf"`,`"None"`; also treat fields containing the numeric value 0 as well as for empty fields and fields containing only white spaces as missing. (We are aware that this test generic test might be overshooting in practice ;-))

In [None]:
def missing_values(x):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
from nose.tools import assert_equal
assert_equal(type(missing_values(tidy(mn))), list)
assert_equal(all(isinstance(i, int) for i in missing_values(tidy(mn))), True)


### 1.2. Analytical part

* Does the dataset contain missing values?
* If no, explain how you proved that this is actually the case.
* If yes, describe the discovered missing values. What could be an explanation for their missingness?

Write your answer in the markdown cell bellow. Do NOT delete or replace the answer cell with another one!


YOUR ANSWER HERE

------
## 2. Handling missing values
### 2.1. Code part
Apply a (simple) function called *handling_missing_values* for handling missing values using an adequate single-imputation technique  of your choice per type of missing values. Make use of the techniques learned in Unit 4. The function should take as an input a dataframe and return the updated dataframe. Mind the following:
- The objective is to apply single imputation on these synthetic data. Do not make up a background story (at this point)!
- Do NOT simply drop the missing values. This is not an option.
- The imputation technique must be adequate for a given variable type (quantitative, qualitative). 

In [None]:
def handling_missing_values(x):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
from nose.tools import assert_equal
assert_equal(len(missing_values(handling_missing_values(tidy(mn)))), 0)
assert_equal(handling_missing_values(tidy(mn)).shape, tidy(mn).shape)

### 2.2. Analytical part
Discuss the implications. Answer the following:

- How would you qualify the data-generating processes leading to different types of missing values, provided that the data was not synthetic?
- What are the benefits and disadvantages of the chosen single-imputation technique?
- How would you apply a multiple-imputation technique to one type of missing values, if applicable at all?
- We asked you to test for/treat as missing values by checking certain field values, as well as empty fields or fields containing the numeric value 0... what are potential problems of this heuristics?

Write your answer in the markdown cell bellow. Do NOT delete or replace the answer cell with another one!

YOUR ANSWER HERE

-----
## 3. Duplicate entries
Implement a function called `duplicates` that takes as an input a (tidy) dataframe `x`. Assume that `duplicates` receives a dataframe as returned from your Step 0 implementation of `tidy`. It then checks whether there are any duplicates in the dataset. Record the row ids of the observations being duplicates and have `duplicates` returns the list in the end. An empty list indicates the absence of duplicated observations.

In [None]:
def duplicates(x):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
from nose.tools import assert_equal
assert_equal(type(duplicates(tidy(mn))), list)


-----
## 4. Handling duplicate entries
### 4.1. Code part
Implement a function called `handling_duplicate_entries` for handling duplicate entries. Again, the function is assumed to receive a tidied data set as obtained from Step 0. It deduplicates the tidy data set. The function then returns the dataframe without duplicates.

In [None]:
def handling_duplicate_entries(x):
    # YOUR CODE HERE
    raise NotImplementedError()


In [None]:
from nose.tools import assert_equal
assert_equal(len(duplicates(handling_duplicate_entries(tidy(mn)))), 0)

### 4.2. Analytical part
Discuss the implications. 

- What are the benefits and disadvantages of the chosen duplicate definition and the chosen duplicate-handling technique?
- Name and explain one alternative definition of (intra-source) duplicates for the given dataset!

Write your answer in the markdown cell bellow. Do NOT delete or replace the answer cell with another one!

YOUR ANSWER HERE