

# ANOVA test on Automobile DataSet



## Objectives

*   Determine whether the variation in average price of cars with different types of Drive wheels.


<h2>Table of Contents</h2>

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li>Import Data from Module</li>
    <li>Step by Step Analysis of statistics of ANOVA the Pythonic way</li>
    <li>Analysis of statistics of ANOVA using the Built in Function.</a></li>
</ol>

</div>

<hr>


<h3>What are the main characteristics that have the most impact on the car price?</h3>


<h2 id="import_data">1. Import Data </h2>


In this section, you will learn how to load a dataset into the Jupyter Notebook.<br>

In our case, the Automobile Dataset is an online source, and it is in a CSV (comma separated value) format. Let's use this dataset as an example to practice data reading.

<ul>
    <li>Data source: <a href="https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01" target="_blank">https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data</a></li>
    <li>Data type: csv</li>
</ul>
The Pandas Library is a useful tool that enables us to read various datasets into a dataframe; our Jupyter notebook platforms have a built-in <b>Pandas Library</b> so that all we need to do is import Pandas without installing.
</p>


Import libraries:


In [43]:
import pandas as pd
import numpy as np
from scipy import stats

Load the data and store it in dataframe `df`:


In [6]:

df = pd.read_csv("https://raw.githubusercontent.com/Lakshmiholla-2808/ANOVA/main/automobileEDA.csv")
df.head()

Unnamed: 0,symboling,normalized-losses,make,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,...,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price,city-L/100km,horsepower-binned,diesel,gas
0,3,122,alfa-romero,std,two,convertible,rwd,front,88.6,0.811148,...,9.0,111.0,5000.0,21,27,13495.0,11.190476,Medium,0,1
1,3,122,alfa-romero,std,two,convertible,rwd,front,88.6,0.811148,...,9.0,111.0,5000.0,21,27,16500.0,11.190476,Medium,0,1
2,1,122,alfa-romero,std,two,hatchback,rwd,front,94.5,0.822681,...,9.0,154.0,5000.0,19,26,16500.0,12.368421,Medium,0,1
3,2,164,audi,std,four,sedan,fwd,front,99.8,0.84863,...,10.0,102.0,5500.0,24,30,13950.0,9.791667,Medium,0,1
4,2,164,audi,std,four,sedan,4wd,front,99.4,0.84863,...,8.0,115.0,5500.0,18,22,17450.0,13.055556,Medium,0,1


## 2. Step by Step Analysis of statistics of ANOVA the Pythonic way

<h3>ANOVA: Analysis of Variance</h3>
<p>The Analysis of Variance  (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:</p>

<p><b>F-test score</b>: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.</p>

<p><b>P-value</b>:  P-value tells how statistically significant our calculated score value is.</p>

<p>If our price variable is strongly correlated with the variable we are analyzing, we expect ANOVA to return a sizeable F-test score and a small p-value.</p>


<h3>Value Counts</h3>


<p>Value counts is a good way of understanding how many units of each characteristic/variable we have. We can apply the "value_counts" method on the column "drive-wheels". Don’t forget the method "value_counts" only works on pandas series, not pandas dataframes. As a result, we only include one bracket <code>df['drive-wheels']</code>, not two brackets <code>df[['drive-wheels']]</code>.</p>


We can convert the series to a dataframe as follows:


In [79]:
df['drive-wheels'].value_counts().to_frame()

Unnamed: 0,drive-wheels
fwd,118
rwd,75
4wd,8


Let's repeat the above steps but save the results to the dataframe "drive_wheels_counts" and rename the column  'drive-wheels' to 'value_counts'.


In [14]:
drive_wheels_counts = df['drive-wheels'].value_counts().to_frame()
drive_wheels_counts.rename(columns={'drive-wheels': 'value_counts'}, inplace=True)
drive_wheels_counts

Unnamed: 0,value_counts
fwd,118
rwd,75
4wd,8


Now let's rename the index to 'drive-wheels':


In [15]:
drive_wheels_counts.index.name = 'drive-wheels'
drive_wheels_counts

Unnamed: 0_level_0,value_counts
drive-wheels,Unnamed: 1_level_1
fwd,118
rwd,75
4wd,8


###  Basics of Grouping


<p>The "groupby" method groups data by different categories. The data is grouped based on one or several variables, and analysis is performed on the individual groups.</p>

<p>For example, let's group by the variable "drive-wheels". We see that there are 3 different categories of drive wheels.</p>


In [83]:
df['drive-wheels'].unique()

array(['rwd', 'fwd', '4wd'], dtype=object)

<p>If we want to know, on average, which type of drive wheel is most valuable, we can group "drive-wheels" and then average them.</p>

<p>We can select the columns 'drive-wheels', 'body-style' and 'price', then assign it to the variable "df_group_one".</p>


In [84]:
df_group_one = df[['drive-wheels','body-style','price']]

We can then calculate the average price for each of the different categories of data.


In [85]:
# grouping results
df_group_one = df_group_one.groupby(['drive-wheels'],as_index=False).mean()
df_group_one

Unnamed: 0,drive-wheels,price
0,4wd,10241.0
1,fwd,9244.779661
2,rwd,19757.613333


<p>From our data, it seems rear-wheel drive vehicles are, on average, the most expensive, while 4-wheel and front-wheel are approximately the same in price.</p>

<p>You can also group by multiple variables. For example, let's group by both 'drive-wheels' and 'body-style'. This groups the dataframe by the unique combination of 'drive-wheels' and 'body-style'. We can store the results in the variable 'grouped_test1'.</p>


In [86]:
# grouping results
df_gptest = df[['drive-wheels','body-style','price']]
grouped_test1 = df_gptest.groupby(['drive-wheels','body-style'],as_index=False).mean()
grouped_test1

Unnamed: 0,drive-wheels,body-style,price
0,4wd,hatchback,7603.0
1,4wd,sedan,12647.333333
2,4wd,wagon,9095.75
3,fwd,convertible,11595.0
4,fwd,hardtop,8249.0
5,fwd,hatchback,8396.387755
6,fwd,sedan,9811.8
7,fwd,wagon,9997.333333
8,rwd,convertible,23949.6
9,rwd,hardtop,24202.714286


<h3>Drive Wheels</h3>


<p>Since ANOVA analyzes the difference between different groups of the same variable, the groupby function will come in handy. Because the ANOVA algorithm averages the data automatically, we do not need to take the average before hand.</p>

<p>To see if different types of 'drive-wheels' impact  'price', we group the data.</p>


In [7]:
grouped_test=df[['drive-wheels', 'price']].groupby(['drive-wheels'])
grouped_test.head(2)

Unnamed: 0,drive-wheels,price
0,rwd,13495.0
1,rwd,16500.0
3,fwd,13950.0
4,4wd,17450.0
5,fwd,15250.0
136,4wd,7603.0


In [8]:
grouped_test.mean()

Unnamed: 0_level_0,price
drive-wheels,Unnamed: 1_level_1
4wd,10241.0
fwd,9244.779661
rwd,19757.613333


In [10]:
grouped_test['price'].mean().sum()/3

13081.130998116763

In [16]:
drive_wheels_counts

Unnamed: 0_level_0,value_counts
drive-wheels,Unnamed: 1_level_1
fwd,118
rwd,75
4wd,8


### Mean Square Variation between Groups

Mean Square Variation between Groups=**SSB/(k-1)**

Where, **SSB=Sum of squared differences between group mean and overall mean**   
**k= Number of groups which is 3**  


In [22]:
8*pow((10241-13081.13), 2) +118*pow((9244.77-13081.13),2) +75*pow((19757.61-13081.13),2)

5144368246.468

In [18]:
5144368246.468/2

2572184123.234

### Calculating Mean square variation within groups

Mean Square Variation within Groups=**SSE/(N-K)**

Where, **SSE=sum of the squared difference between each price value of that group with the mean price of the group and then again adding them up for all the groups.**

**N=Total Number of observations**
 
**K=Number of groups.**

#### Creating 3 DataFrames for Each wheel type

In [21]:
rwd_df=df[df['drive-wheels']=='rwd']

fwd_df=df[df['drive-wheels']=='fwd']

fourwd_df=df[df['drive-wheels']=='4wd']

In [23]:
rwd_df['SSE']=pow((rwd_df['price']-19757.613333),2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [24]:
rwd_df['SSE']

0      3.922033e+07
1      1.061204e+07
2      1.061204e+07
9      1.107301e+07
10     8.023698e+06
           ...     
196    8.483316e+06
197    5.078178e+05
198    2.983865e+06
199    7.357041e+06
200    8.221906e+06
Name: SSE, Length: 75, dtype: float64

In [25]:
rwd_df['SSE'].sum()

6104495457.786666

In [26]:
fwd_df['SSE']=pow((fwd_df['price']-9244.779661),2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [27]:
fwd_df['SSE']

3      2.213910e+07
5      3.606267e+07
6      7.165996e+07
7      9.360989e+07
8      2.140433e+08
           ...     
185    5.523536e+06
186    5.405489e+05
187    1.640428e+07
188    2.116203e+07
189    9.273367e+06
Name: SSE, Length: 118, dtype: float64

In [28]:
fwd_df['SSE'].sum()

1309819112.2711866

In [30]:
fourwd_df['SSE']=pow((fourwd_df['price']-10241.000000),2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [31]:
fourwd_df['SSE']

4      51969681.0
136     6959044.0
140     1016064.0
141     1036324.0
144     4963984.0
145     2111209.0
150     5489649.0
151     2140369.0
Name: SSE, dtype: float64

In [32]:
fourwd_df['SSE'].sum()

75686324.0

In [33]:
rwd_df['SSE'].sum()+fourwd_df['SSE'].sum()+fwd_df['SSE'].sum()

7490000894.057853

In [35]:
len(df)

201

In [36]:
7490000894.057853/(201-3)

37828287.34372653

In [37]:
2572184123.234/37828287.34372653

67.99631450035956

## 3. Analysis of statistics of ANOVA using the Built in Function.

We can obtain the values of the method group using the method "get_group".


We can use the function 'f_oneway' in the module 'stats' to obtain the <b>F-test score</b> and <b>P-value</b>.


In [44]:
# ANOVA
f_val, p_val = stats.f_oneway(grouped_test.get_group('fwd')['price'], grouped_test.get_group('rwd')['price'], grouped_test.get_group('4wd')['price'])  
 
print( "ANOVA results: F=", f_val)   

ANOVA results: F= 67.95406500780399


Hence there is a huge difference in the sample means.

In our case I have considered only the influence of only one factor Drive wheels on response variable price. So, it is only one-way ANOVA

We find that both the built-in function and the analysis which I did through step by step way gives the same F-test score.






## Author

<a href="https://www.linkedin.com/in/lakshmi-holla-b39062149/" target="_blank">Lakshmi Holla</a>


