# CAR COMPANY (20 mins)
# Instructions / Notes: Read these carefully
This **Python Jupyter Notebook** notebook is split into the following sections:

1. **Initial section** with pre-filled cells, that you should run just to load some Python modules (packages), the dataset required for your task and its variables in memory.
2. **Middle section** with **description of a concrete task** associated with the dataset, and some **ideas** related to the task that you may choose to work with.
3. **Final section (with one or more empty cells)** where you can perform analyses with the loaded dataset (e.g., write a few lines of code if needed), answer the question posed, and describe your reasoning in words. 

**Read and execute each cell in order, without skipping forward**. To execute any cell, press **Shift+Enter** on your keyboard. It might take a couple of seconds to receive an output. 

Have fun!

In [None]:
#Run the following to import necessary packages and import dataset. 
import pandas as pd
import numpy as np
import scipy as sp
from pandas.plotting import parallel_coordinates
from othersAQ_FD import different_idea
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')

d1 = "AQ_phase1_dataset1.csv"
d2 = "AQ_phase1_dataset2.csv"
d3 = "AQ_phase1_dataset3.csv"
d4 = "AQ_phase1_dataset4.csv"

df1 = pd.read_csv(d1)
df2 = pd.read_csv(d2)
df3 = pd.read_csv(d3)
df4 = pd.read_csv(d4)

df1_copy=df1.copy()
df2_copy=df2.copy()
df3_copy=df3.copy()
df4_copy=df4.copy()
df1_copy['Input_Dataset_name'] = 'Dataset 1 (Company A)'
df2_copy['Input_Dataset_name'] = 'Dataset 2 (Company B)'
df3_copy['Input_Dataset_name'] = 'Dataset 3 (Company C)'
df4_copy['Input_Dataset_name'] = 'Dataset 4 (Company D)'

#Print first five lines of dataset 1 as a check to see if the datasets are loaded properly.
df1.head(n=5)

# DATASET DESCRIPTION
Each of the 4 dataframes loaded above represents the **total number of units sold** (in 100’s) and **employee satisfaction** (on a scale of 1 to 100) from 182 sites all over the world for car companies 1, 2, 3 and 4. 

Run the cells below to obtain some descriptive (numerical) statistics and a parallel coordinates visualization for these datasets. 

1. **Median** is a measure of central tendency that separates the higher half from the lower half of a data sample.

2. **Interquartile range (IQR)** is a measure of variability (statistical dispersion), based on dividing a data set into quartiles. Quartiles divide a rank-ordered data set into four equal parts. IQR is equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles.

3. **Spearman's correlation** measures the strength and direction of monotonic association between two variables. A monotonic relationship is a relationship that does one of the following: (1) as the value of one variable increases, so does the value of the other variable; or (2) as the value of one variable increases, the other variable value decreases. 

4. **Parallel coordinates** is a plotting technique for multivariate data (allows one to estimate some descriptive statistics visually). Here, data points are represented as connected line segments. Each vertical line represents one data attribute. One complete set of connected line segments across all the attributes represents one data point.

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
#You will receive 4 outputs: median for both variables, inter-quartile range for both variables, Spearman's correlation between the variables, and a parallel coordinates visualization

#CAR COMPANY 1
print ("Median (company 1)")
round(df1.median(),2)
print (" ")
print ("----")

print ("Interquartile range (company 1)")
round((df1.quantile(q=0.75) - df1.quantile(q=0.25)),2)
print (" ")
print ("----")

print ("Spearman correlation (company 1)")
round(df1.corr(method='spearman'),2)
print (" ")
print ("----")

print ("Parallel coordinates visualization (company 1)")
parallel_coordinates(df1_copy, 'Input_Dataset_name')
plt.show()

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
#You will receive 4 outputs: median for both variables, inter-quartile range for both variables, Spearman's correlation between the variables, and a parallel coordinates visualization

#CAR COMPANY 2
print ("Median (company 2)")
round(df2.median(),2)
print (" ")
print ("----")

print ("Interquartile range (company 2)")
round((df2.quantile(q=0.75) - df2.quantile(q=0.25)),2)
print (" ")
print ("----")

print ("Spearman correlation (company 2)")
round(df2.corr(method='spearman'),2)
print (" ")
print ("----")

print ("Parallel coordinates visualization (company 2)")
parallel_coordinates(df2_copy, 'Input_Dataset_name')
plt.show()

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
#You will receive 4 outputs: median for both variables, inter-quartile range for both variables, Spearman's correlation between the variables, and a parallel coordinates visualization

#CAR COMPANY 3
print ("Median (company 3)")
round(df3.median(),2)
print (" ")
print ("----")

print ("Interquartile range (company 3)")
round((df3.quantile(q=0.75) - df3.quantile(q=0.25)),2)
print (" ")
print ("----")

print ("Spearman correlation (company 3)")
round(df3.corr(method='spearman'),2)
print (" ")
print ("----")

print ("Parallel coordinates visualization (company 3)")
parallel_coordinates(df3_copy, 'Input_Dataset_name')
plt.show()

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
#You will receive 4 outputs: median for both variables, inter-quartile range for both variables, Spearman's correlation between the variables, and a parallel coordinates visualization

#CAR COMPANY 4
print ("Median (company 4)")
round(df4.median(),2)
print (" ")
print ("----")

print ("Interquartile range (company 4)")
round((df4.quantile(q=0.75) - df4.quantile(q=0.25)),2)
print (" ")
print ("----")

print ("Spearman correlation (company 4)")
round(df4.corr(method='spearman'),2)
print (" ")
print ("----")

print ("Parallel coordinates visualization (company 4)")
parallel_coordinates(df4_copy, 'Input_Dataset_name')
plt.show()

# TASK
Design **as many measures** to **rank order** the datasets from the **most successful** to the **least successful** car company. Your measures should be based on consideration of every data point in the datasets. We expect you to **generate multiple measures**.

For **each measure that you design**: 

1. Please mark the resulting **dataset ordering** (e.g., 1234, 2134 etc)
2. Please provide a brief **reasoning** behind your answer (an explanation of **why** you took certain steps or performed certain calculations to get to the solution)
3. Please mark your **confidence** in the designed measure (on a scale of 1 to 5)
4. Please mark how many **ideas did you request** so far to develop your measure (e.g., 0, 1, 2 etc)

**MAKE SURE** to fill all four fields for each measure.


# IDEA:
**Matplotlib has a handy function to generate 1-D histogram: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html**


# Important note about the idea
You may choose to work (or, not work) with this idea in developing your measure. In case the idea information is not helpful and you are not sure if/how you might design new measures (or, revise measures you already designed), you can ask for a different idea by typing **different_idea ("AQ", "FD1")** in the code cell below the template cell.



In [None]:
#Template for designing a measure. Make copies of this template cell to create as many measures as you are able to (within the allotted time).

#NOTE: Round all your statistics to 2 decimal places before reasoning with them!! 

#REPORT YOUR ANSWER (DATASET ORDERING)
carcompany_ordering_measure = 'None' 
#Choose among: "1234", "1243", "1324", "1342", "1423", "1432", "2134", "2143", "2314", "2341", "2413", "2431", "3124", "3142", "3214", "3241", "3412", "3421", "4123", "4132", "4213", "4231", "4312", "4321"
print(carcompany_ordering_measure)

#REPORT YOUR REASONING
carcompany_ordering_reasoning_measure = 'None'
print(carcompany_ordering_reasoning_measure)

#REPORT CONFIDENCE IN YOUR SOLUTION
confidence_measure = 'None' 
#Choose among: 1 (low confidence), 2, 3 (medium confidence), 4, 5 (high confidence)
print(confidence_measure)

#REPORT A COUNT OF IDEAS REQUESTED SO FAR TO DEVELOP YOUR SOLUTION
ideas_asked_so_far_measure = 'None'
#Choose among: 0 (Did not use the provided idea) 1 (Only used the provided idea), 2 (Asked one additional idea), 3 (Asked two additional ideas)
print(ideas_asked_so_far_measure)


In [None]:
#ONLY use this space below to write your code (if needed) for any measure you generate. DO NOT ERASE this code segment from the workbook.

#IF YOU WANT TO ASK FOR A DIFFERENT IDEA, UNCOMMENT THE LINE BELOW, AND JUST RE-RUN THIS CELL
#different_idea("AQ","FD1")










#Your intuitive ideas are valuable!!If you need syntax-related help in implementing your ideas, you can access the following documentation files (use the "Search" tab for queries) and/or summarized syntax sheets.

#a) Pandas library
#Documentation file: https://pandas.pydata.org/pandas-docs/stable/
#Syntax sheet: https://datacamp-community-prod.s3.amazonaws.com/fbc502d0-46b2-4e1b-b6b0-5402ff273251

#b) Numpy library
#Documentation file: https://docs.scipy.org/doc/numpy/user/index.html
#Syntax sheet: https://datacamp-community-prod.s3.amazonaws.com/e9f83f72-a81b-42c7-af44-4e35b48b20b7

#c) Matplotlib library
#Documentation file: https://matplotlib.org/contents.html
#Syntax sheet: https://datacamp-community-prod.s3.amazonaws.com/28b8210c-60cc-4f13-b0b4-5b4f2ad4790b

#d) Scipy library
#Documentation file: https://docs.scipy.org/doc/scipy/reference/
#Syntax sheet: https://datacamp-community-prod.s3.amazonaws.com/5710caa7-94d4-4248-be94-d23dea9e668f
