#### Exploratory Data Analysis (EDA)

###### Reliminary step in data analysis to:

	* Summarize main characteristics of the data
    * Gain better understanding of the data set
    * Uncover relationships between variables
    * Extract important variables 
    
**Question**?
	
    * "What are the characteristics that have te most impact on the car price?"

**Material**

	* Descriptive Statistics (which describe basic features of a dataset and obtains a short summary about the sample and measures of the data)
    
    * GroupBy (basic of grouping Data using groupBy, and how this can help to transform our dataset)
    
    * ANOVA (the analysis of variance, a statistical method in which the variation in a set of observations is divided into 	  distinct component
    
    * Correlation (between two different variables)
    
    * Correlation - Statistics ( Advanced Correlation, where we'll introduce you to various correlation statistical methods, namely Pearson Correlation and Correlation Heatmaps

## Descriptive Statistics

* Describe basic features of data

When you begin to analyze data, it’s important to first explore your data before you spend
time building complicated models. One easy way to do so is to calculate some
descriptive statistics for your data. 

* Given short summaries about the sample and measure of the data

Descriptive statistical analysis helps to
describe basic features of a dataset and obtains a short summary about the sample and measures
of the data.

* Summarize statistics using pandas **describe()** method

Using the describe function and applying it
on your dataframe, the "describe" function automatically computes basic statistics for
all numerical variables. It shows the mean, the total number of data
points, the standard deviation, the quartiles and the extreme values.

Any **NaN** values are automatically skipped in these statistics. This function will give you a clearer idea
of the distribution of your different variables. You could have also categorical variables
in your dataset.

In [2]:
import pandas as pd
import matplotlib.pylab as plt

In [3]:
#creating headers
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]

In [4]:
#reading the csv file
df = pd.read_csv('Imports_Autos_85.csv', names = headers)

In [5]:
# To see what the data set looks like, we'll use the head() method.
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [7]:
# Using describe() method
df.describe()

Unnamed: 0,symboling,wheel-base,length,width,height,curb-weight,engine-size,compression-ratio,city-mpg,highway-mpg
count,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0
mean,0.834146,98.756585,174.049268,65.907805,53.724878,2555.565854,126.907317,10.142537,25.219512,30.75122
std,1.245307,6.021776,12.337289,2.145204,2.443522,520.680204,41.642693,3.97204,6.542142,6.886443
min,-2.0,86.6,141.1,60.3,47.8,1488.0,61.0,7.0,13.0,16.0
25%,0.0,94.5,166.3,64.1,52.0,2145.0,97.0,8.6,19.0,25.0
50%,1.0,97.0,173.2,65.5,54.1,2414.0,120.0,9.0,24.0,30.0
75%,2.0,102.4,183.1,66.9,55.5,2935.0,141.0,9.4,30.0,34.0
max,3.0,120.9,208.1,72.3,59.8,4066.0,326.0,23.0,49.0,54.0


#### Value_counts()

* summarize the categorical data is by using the **value_counts()** method.

For example, in our dataset we have the drive
system as a categorical variable, which consists of the categories: forward-wheel drive, rear-wheel
drive, and four-wheel drive. One way you can summarize the categorical
data is by using the function value_counts(). We can change the name of the column to make
it easier to read.




In [16]:
df['drive-wheels'].value_counts() # to summarize

fwd    120
rwd     76
4wd      9
Name: drive-wheels, dtype: int64

In [22]:
# its count the values in drive-wheels column to summarize
drive_wheels_counts = df['drive-wheels'].value_counts()
print(drive_wheels_counts)

fwd    120
rwd     76
4wd      9
Name: drive-wheels, dtype: int64


We see that we have 120 cars in the fwd (front
wheel drive) category, 76 cars in the rwd (rear wheel drive) category, and 9 cars in
the 4wd (four wheel drive) category. Boxplots are a great way to visualize numeric
data, since you can visualize the various distributions of the data.
The main features that the boxplot shows are the median of the data, which represents where
the middle datapoint is. The Upper Quartile shows where the 75th percentile
is, the Lower Quartile shows where the 25th percentile is. The data between the Upper
and Lower Quartile represents the Interquartile Range.
Next, you have the Lower and Upper Extremes. These are calculated as 1.5 times the interquartile
range above the 75th percentile, and as 1.5 times the IQR below the 25th percentile.
Finally, boxplots also display outliers as individual dots that occur outside the upper
and lower extremes. With boxplots, you can easily spot outliers
and also see the distribution and skewness of the data.

In [None]:
# rename the drive-wheels columns in value_counts
drive_wheels_counts = drive_wheels_counts.rename(index={'drive-wheels':'value_counts'}, inplace=True)

In [31]:
drive_wheels_counts.name= 'drive-wheels'
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


#### Descriptive Statistics - Box plots

We see that we have 118 cars in the fwd (front
wheel drive) category, 75 cars in the rwd (rear wheel drive) category, and 8 cars in
the 4wd (four wheel drive) category. Boxplots are a great way to visualize numeric
data, since you can visualize the various distributions of the data.
The main features that the boxplot shows are the median of the data, which represents where
the middle datapoint is. The Upper Quartile shows where the 75th percentile
is, the Lower Quartile shows where the 25th percentile is. The data between the Upper
and Lower Quartile represents the Interquartile Range.
Next, you have the Lower and Upper Extremes. These are calculated as 1.5 times the interquartile
range above the 75th percentile, and as 1.5 times the IQR below the 25th percentile.
Finally, boxplots also display outliers as individual dots that occur outside the upper
and lower extremes. With boxplots, you can easily spot outliers
and also see the distribution and skewness of the data.