# Data Wrangling


**1. Describe the data: what values it mostly contains, what is the size of the dataset etc.**  
Prior to processing there are 100,200 entries in the dataset and more columns/features than are relevant. Most of these values appear to be one's and zeroes. In a real-world scenario I might consider whether or not the bulk of the data is boolean data or not, but in this instance it seems to just be noise, because giving a candidate a table with only a few columns would be too easy. Based on the task, the dataset's clearly intended to simulate the results of a survey.

    - Before processing: 100,200 entries.
    - Mostly 1's and 0's, but there's other values in there (spotted a few 2's, and q2 and q4 clearly accept 2's, so I doubt it's boolean data or normalised data). 
    - There'd probably normally be more diversity in the values than 1's and 0's
    - A person's response? Multiple choice? Flags for whether the user did somthing or didn't?
    
**2. Drop all duplicate respondents based on `respondent_index` column (keep the first occurrence of the respondent)**  
Easily doable with a dataframe. The functions `dataset.groupby(dataset.index).nth(0)` deal with this. Originally tried the `.first()` function, but when I ran it on a smaller set of dummy data I found it was ignoring rows with NaN values. Since I'm being asked to take the first value, and not yet being expected to remove NaN values, `.nth(0)` works.

**3. We will compute weighting based on the `q2` (gender) and `q4` (age group) column. Drop all respondents having NaNs in these columns. Do you know why this could happen and what the possible solutions are?**  
Dropping NaN values from specified columns is pretty straightforward: `dataset.dropna(subset=['q2', 'q4'])`. 

The NaN values would originate from users who have not specified their age or gender, presumably because they've chosen not to specify. In other instances it might stem from the user not even being presented with the question in particular. A common example would be "What country are you from?" If the user selects "US" there might be a field asking what state, but that field doesn't appear to anyone outside the US.

Typically you would either drop the NaN values, or fill them in with an average value for that column/feature. 

**4. There is a `q3_column.csv` file in the directory. Download it and merge it on the index to the original dataframe. Are there more variants of (a database) merge?**  
    - I've loaded the table as a dataframe, and the q3 file as a series. I've then just concatenated them.
    - Uuh, I don't know?

# Computation

    - Well `factor = 1` just smells of mistake.
    - Even if it's not, count and quote_size are still unused, and you want to scale it by target over true vale. Which is quote_size/count.

Numbers on each line clearly don’t match, so you need to edit the code so it does match. There is a bug in computing the `factor` in function `get_factor_weights`, which is a value which should be assigned to a respondent with a given combination of q2 and q4. 

TASK: You have to locate and fix this bug in order to complete this part. The numbers on each line should match. Hence, the output of a correct solution should be:

    $ python compute_weights.py tmp.hdf
    (q2, q4): (0, 0)	 1000.0 	 1000
    (q2, q4): (0, 1)	 2000.0 	 2000
    (q2, q4): (0, 2)	 3000.0 	 3000
    (q2, q4): (1, 0)	 5000.0 	 5000
    (q2, q4): (1, 1)	 8000.0 	 8000
    (q2, q4): (1, 2)	 3000.0 	 3000

VS

    $ python compute_weights.py tmp.hdf
    (q2, q4): (0, 0)	 1643.0 	 1000
    (q2, q4): (0, 1)	 21698.0 	2000
    (q2, q4): (0, 2)	 26548.0 	3000
    (q2, q4): (1, 0)	 1743.0 	 5000
    (q2, q4): (1, 1)	 21658.0 	8000
    (q2, q4): (1, 2)	 26809.0 	3000


# Results

Create a github repository, push your solution and your report to this repository (not the data). The report should give us a good picture of how you approached and solved the problem. Use whatever format you like, but Jupyter notebook is prefered. 


# Bonus Task 

On a VPS provider of your choice (there are e.g. free trials on Digital Ocean, Google Compute, AWS, etc.), create a machine with any Linux distribution. Install an ElasticSearch instance, upload the final dataframe to the server and then write a Python script which reads the file line by line and writes data into the database. 

*Note:* What's the purpose here? To demonstrate ability with cloud servers, or to demonstrate linux/push capability?