**Univariate Analysis: Basic Python**

Use what you learned from class to complete the following tasks using Python code and your dataframe.

Read in the mpg.csv dataset. Remember to import the pandas package everytime. Make sure the new dataframe is viewable in the output.

In [62]:
import pandas as pd

df = pd.read_csv('mpg.csv')

Create a DataFrame that contains the following properties and statistics of all features in the MPG.csv dataset:

* Count
* Quantiles(.25, .5, .75)
* Mean
* Min
* Max
* Standard Deviation

In [63]:

numeric_df = df.select_dtypes(include=['float64', 'int64'])

count = numeric_df.count()
quantiles = numeric_df.quantile([0.25, 0.5, 0.75])
mean = numeric_df.mean()
min_values = numeric_df.min()
max_values = numeric_df.max()
std_dev = numeric_df.std()
summary_df = pd.DataFrame({
    'Count': count,
    '25th Percentile': quantiles.loc[0.25],
    '50th Percentile (Median)': quantiles.loc[0.5],
    '75th Percentile': quantiles.loc[0.75],
    'Mean': mean,
    'Min': min_values,
    'Max': max_values,
    'Std Dev': std_dev
})

print(summary_df)


              Count  25th Percentile  50th Percentile (Median)  \
MPG             392           17.000                     22.75   
Cylinders       392            4.000                      4.00   
Displacement    392          105.000                    151.00   
Horse_Power     392           75.000                     93.50   
Weight          392         2225.250                   2803.50   
Acceleration    392           13.775                     15.50   
Model_Year      392           73.000                     76.00   

              75th Percentile         Mean     Min     Max     Std Dev  
MPG                    29.000    23.445918     9.0    46.6    7.805007  
Cylinders               8.000     5.471939     3.0     8.0    1.705783  
Displacement          275.750   194.411990    68.0   455.0  104.644004  
Horse_Power           126.000   104.469388    46.0   230.0   38.491160  
Weight               3614.750  2977.584184  1613.0  5140.0  849.402560  
Acceleration           17.025    

Given the above table, do any of the variables seem to have missing values? (HINT: Check the Count and if 0 is located in any of the variables). Which variables seem to be normally distributed at a glance? (Check how far away the min and max are from the mean). Which variables are missing from your table? Answer this in the markdown section below.

Given the count of all variables being 392, it does not appear any variables have missing values. The variables that are seemingly distributed are MPG, Cylinders, Displacement, Horse_Power, Weight, and Acceleration. The only variable missing within my table is name, but since it is a string variable instead, we can leave that out when doing statistics. 

From the data in this new DataFrame, what is the skewness of MPG? 

In [81]:
mpg_skewness = df['MPG'].skew()

print("Skewness of MPG:", mpg_skewness)
#0.45709232306041025

Skewness of MPG: 0.45709232306041025


What is the standard deviation of the weight from the mpg.csv dataset? 

In [65]:
weight_std_dev = df['Weight'].std()

print("Standard Deviation of Weight:", weight_std_dev)
#849.4025600429492

Standard Deviation of Weight: 849.4025600429492


What is the mean weight for the mpg.csv dataset?

In [66]:
mean_weight = df['Weight'].mean()

print("Mean Weight:", mean_weight)
#2977.5841836734694

Mean Weight: 2977.5841836734694


*How many* unique values are there for Name?

In [67]:
unique_names_count = df['Name'].nunique()

print("Number of unique values for Name:", unique_names_count)
#301

Number of unique values for Name: 301


From this number alone, can you tell if any of the names are repeated in the dataset? Show in the code.

In [68]:
total_names_count = df['Name'].count()
unique_names_count = df['Name'].nunique()
repeated_names_exist = total_names_count > unique_names_count

print("Total number of entries for Name:", total_names_count)
print("Number of unique values for Name:", unique_names_count)
print("Are there any repeated names?", repeated_names_exist)

Total number of entries for Name: 392
Number of unique values for Name: 301
Are there any repeated names? True


*What are* the unique values for Cylinders?

In [69]:
unique_cylinders = df['Cylinders'].unique()

print("Unique values for Cylinders:", unique_cylinders)
#[8 4 6 3 5]

Unique values for Cylinders: [8 4 6 3 5]


What is the mode of Displacement?

In [70]:
displacement_mode = df['Displacement'].mode()
print("Mode of Displacement:", displacement_mode)
#97.0


Mode of Displacement: 0    97.0
Name: Displacement, dtype: float64


Show the types each of the variables (fields/columns). Do they seem fine or do any seem different than expected?

In [71]:
print(df.dtypes)

MPG             float64
Cylinders         int64
Displacement    float64
Horse_Power       int64
Weight            int64
Acceleration    float64
Model_Year        int64
Name             object
dtype: object


The data types seem fine, it makes complete sense that MPG, Displacement, and Acceleration are float points, consindering it would be needed for them to be precise. I would have expected maybe horse power and weight to also be floats vs integers. Name being an object also makes sense.

Check for missing values (NA). Are there any?

In [72]:
missing_values = df.isna().sum()

print("Missing values in each column:")
print(missing_values)

Missing values in each column:
MPG             0
Cylinders       0
Displacement    0
Horse_Power     0
Weight          0
Acceleration    0
Model_Year      0
Name            0
dtype: int64


It does not appear that any missing values withinthe dataframe.

**BONUS** *(2 pts)*: Create another DataFrame that contains the following properties and statistics of all features in the MPG.csv dataset:

* Count
* Data type
* Number unique
* Number missing
* Quantiles(.25, .5, .75)
* Mean
* Median
* Mode
* Min
* Max
* Standard Deviation
* Kurtosis
* Skewness

In [79]:

numeric_df = df.select_dtypes(include=['float64', 'int64'])

summary_df = pd.DataFrame()

summary_df['Count'] = numeric_df.count()
summary_df['Data Type'] = numeric_df.dtypes
summary_df['Number Unique'] = numeric_df.nunique()
summary_df['Number Missing'] = numeric_df.isna().sum()

quantiles = numeric_df.quantile([0.25, 0.5, 0.75])
summary_df['25th Percentile'] = quantiles.loc[0.25]
summary_df['50th Percentile (Median)'] = quantiles.loc[0.5]
summary_df['75th Percentile'] = quantiles.loc[0.75]
summary_df['Mean'] = numeric_df.mean()
summary_df['Median'] = numeric_df.median()
summary_df['Mode'] = numeric_df.mode().iloc[0] 
summary_df['Min'] = numeric_df.min()
summary_df['Max'] = numeric_df.max()


summary_df['Standard Deviation'] = numeric_df.std()
summary_df['Kurtosis'] = numeric_df.kurtosis()
summary_df['Skewness'] = numeric_df.skew()

print(summary_df)


              Count Data Type  Number Unique  Number Missing  25th Percentile  \
MPG             392   float64            127               0           17.000   
Cylinders       392     int64              5               0            4.000   
Displacement    392   float64             81               0          105.000   
Horse_Power     392     int64             93               0           75.000   
Weight          392     int64            346               0         2225.250   
Acceleration    392   float64             95               0           13.775   
Model_Year      392     int64             13               0           73.000   

              50th Percentile (Median)  75th Percentile         Mean   Median  \
MPG                              22.75           29.000    23.445918    22.75   
Cylinders                         4.00            8.000     5.471939     4.00   
Displacement                    151.00          275.750   194.411990   151.00   
Horse_Power                