# Project Python Foundations: Austo

**Marks: 60 points**

## Problem Statement

### Context

In the 21st century, cars are an important mode of transportation that provides us the opportunity for personal control and autonomy. In day-to-day life, people use cars for commuting to work, shopping, visiting family and friends, etc. Research shows that more than 76% of people prevent themselves from traveling somewhere if they don't have a car. Most people tend to buy different types of cars based on their day-to-day necessities and preferences. So, it is essential for automobile companies to analyze the preference of their customers before launching a car model into the market. Austo, a UK-based automobile company aspires to grow its business into the US market after successfully establishing its footprints in the European market.

In order to be familiar with the types of cars preferred by the customers and factors influencing the car purchase behavior in the US market, Austo has contracted a consulting firm. Based on various market surveys, the consulting firm has created a dataset of 3 major types of cars that are extensively used across the US market. They have collected various details of the car owners which can be analyzed to understand the automobile market of the US.


### Objective

Austo's management team wants to understand the demand of the buyers and trends in the US market. They want to build customer profiles based on the analysis to identify new purchase opportunities so that they can manipulate the business strategy and production to meet certain demand levels. Further, the analysis will be a good way for management to understand the dynamics of a new market. Suppose you are a Data Scientist working at the consulting firm that has been contracted by Austo. You are given the task to create buyer's profiles for different types of cars with the available data as well as a set of recommendations for Austo. Perform the data analysis to generate useful insights that will help the automobile company to grow its business.

### Data Description

austo_automobile.csv: The dataset contains buyer's data corresponding to different types of products(cars).

### Data Dictionary

* Age: Age of the customer
* Gender: Gender of the customer
* Profession: Indicates whether the customer is a salaried or business person
* Marital_status: Marital status of the customer
* Education: Refers to the highest level of education completed by the customer
* No_of_dependents: Number of dependents(partner/children/spouse) of the customer
* Personal_loan: Indicates whether the customer availed a personal loan or not
* House_loan: Indicates whether the customer availed house loan or not
* Partner_working: Indicates whether the customer's partner is working or not
* Salary: Annual Salary of the customer
* Partner_salary: Annual Salary of the customer's partner
* Total_salary: Annual household income (Salary + Partner_salary) of the customer's family
* Price: Price of the car
* Make: Car type (Hatchback/Sedan/SUV)

### **Please read the instructions carefully before starting the project.**
This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned. Read along carefully to complete the project.
* Blanks '_______' are provided in the notebook that
needs to be filled with an appropriate code to get the correct result. Please replace the blank with the right code snippet. With every '_______' blank, there is a comment that briefly describes what needs to be filled in the blank space.
* Identify the task to be performed correctly, and only then proceed to write the required code.
* Fill the code wherever asked by the commented lines like "# write your code here" or "# complete the code". Running incomplete code may throw an error.
* Please run the codes in a sequential manner from the beginning to avoid any unnecessary errors.
* You can the results/observations derived from the analysis here and use them to create your final presentation.


## Importing necessary libraries

In [None]:
# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 -q --user

**Note**: *After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.*

In [None]:
# import libraries for data manipulation
import numpy as np
import pandas as pd

# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

## Importing the dataset

In [None]:
# uncomment and run the following lines for Google Colab
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
# read the data
df = pd.read_csv('_______') ## Complete the code to read the data
# returns the first 5 rows
df.head()

## Data Overview

### Question 1: How many rows and columns are present in the data? [0.5 mark]

In [None]:
# Check the shape of the dataset
df._______ ## Complete the Code the to get the shape of dataset

### Question 2: What are the datatypes of the different columns in the dataset? [0.5 mark]

In [None]:
df.info()

### Question 3: Check the statistical summary of the data. List all the observations for each column? [2 marks]

In [None]:
# Get the summary statistics of the numerical data
df.'_______' ## Complete the Code to print the statitical summary of the data (Hint - you have seen this in the case studies before)

#### Question 4: Are there any missing values in the data? If yes, treat them using an appropriate method.  [1 Mark]

In [None]:
# Checking for missing values in the data
df.'______'  #Write the appropriate function to print the sum of null values for each column

### Question 5: How many cars are there of type SUV? [1 mark]

In [None]:
df['______'].value_counts() ## Complete the Code to list the counts of SUV cars

## Exploratory Data Analysis (EDA)

### Univariate Analysis

#### **Question 6:** Explore all the variables and provide observations on the distributions. (Generally, histograms, boxplots, countplots, etc. are used for univariate exploration.) [10 marks]

##### Observations on Age

In [None]:
## Histogram boxplot for the Age
sns.histplot(data=df,x='Age')
plt.show()
sns.boxplot(data=df,x='Age')
plt.show()

##### Gender

In [None]:
# Check Gender type
df['Gender'].'_______' ## Complete the code to find out unique Gender type

In [None]:
sns.countplot(data=df,x='Gender') ## Complete the code to plot 'Gender' column
plt.show()

##### Profession

In [None]:
# Check Profession type
df['Profession'].'_______' ## Complete the code to find out unique Profession type

In [None]:
sns.countplot(data=df,x='______') ## Complete the code to plot  'Profession' column
plt.show()

##### Marital status

In [None]:
# Check the unique values
df['Marital_status'].'_______' ## Complete the code to find out unique Marital_status type

In [None]:
sns.countplot(data=df,x='______') ## Complete the code to plot 'Marital_status' column
plt.show()

##### Education

In [None]:
# Check Education type
df['Education'].'_______' ## Complete the code to find out unique Education type

In [None]:
sns.countplot(data=df,x='______') ## Complete the code to plot 'Education' column
plt.show()

##### Number of dependents

In [None]:
# Check the unique values
df['No_of_Dependents'].'_______' ## Complete the code to find out unique No_of_Dependents type

In [None]:
sns.countplot(data=df,x='______') ## Complete the code to plot for 'No_of_Dependents' column
plt.show()

##### Personal loan

In [None]:
# Check the unique values
df['Personal_loan'].'_______' ## Complete the code to find out unique Personal_loan type

In [None]:
sns.countplot(data=df,x='______') ## Complete the code to plot 'day_of_the_week' column
plt.show()

##### House loan

In [None]:
# # Check the unique values
df['House_loan'].'_______' ## Complete the code to check unique values for the 'House_loan' column

In [None]:
sns.countplot(data=df,x='______') ## Complete the code to plot 'House_loan' column
plt.show()

##### Working status of customer's partner

In [None]:
# # Check the unique values
df['Partner_working'].'_______' ## Complete the code to check unique values for the 'Partner_working' column

In [None]:
sns.countplot(data=df,x='______') ## Complete the code to plot 'Partner_working' column

##### Observations on Salary

In [None]:
## Complete the code to plot histogram and boxplot for 'Salary' column
## Histogram boxplot for the Age
sns.histplot(data=df,x='_____')
plt.show()
sns.boxplot(data=df,x='_____')
plt.show()

##### Observations on Partner's salary

In [None]:
## Complete the code to plot histogram and boxplot for 'Partner's column
sns.histplot(data=df,x='_____')
plt.show()
sns.boxplot(data=df,x='_____')
plt.show()

##### Observations on Total salary

In [None]:
## Complete the code to plot histogram and boxplot for 'Total_salary' column
sns.histplot(data=df,x='_____')
plt.show()
sns.boxplot(data=df,x='_____')
plt.show()

##### Observations on Price

In [None]:
## Complete the code to plot histogram and boxplot for 'Price' column
sns.histplot(data=df,x='_____')
plt.show()
sns.boxplot(data=df,x='_____')
plt.show()

##### Make

In [None]:
# # Check the unique values
df['Make'].'_______' ## Complete the code to check unique values for the 'Make' column

In [None]:
sns.countplot(data=df,x='______') ## Complete the code to plot for 'Make' column

#### Question 7: How many cars are of make Hatchback and priced above 25000. State your observations? [2 marks]

In [None]:
# Get the cars whose make is hatchback.
df_hatchback = df[df['_______'] == '________']

# Get the cars which are priced above 25000.
df_hatchback[df['_______']>25000].shape[0]

#### Question 8: How many owners have bought cars that were priced higher than their salary. How many of them have taken personal loan? [3 marks]

In [None]:
# Get the number of owners who have taken cars pricing higher than their salary
df_pricing = df[df['_______'] > df['_______']]

print('Number of owners who have purchased cars pricing higher than their salary:',df_pricing.shape[0])

# Get the number of owners who  have taken cars pricing higher than their salary and have taken personal loan.
df_personal_loan = df_pricing[df['________'] == 'Yes']
print('Number of owners who have taken personal loan and have cars pricing higher than their salary:', df_personal_loan.shape[0])

### Multivariate Analysis

#### Question 9: Perform bivariate/multivariate analysis to explore relationships between the important variables in the dataset. [15 marks]

In [None]:
# Plot the heatmap to find the correlation between numerical variables
plt.figure(figsize=(15, 7))
sns.heatmap(df.corr(numeric_only = True), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

In [None]:
# pairplot to show the relationship between numerical variables set hue to Make
sns.pairplot(data =________, hue = "Make", diag_kind = "kde")
plt.show()

##### Make vs Age

In [None]:
# boxplot to show relationship between two variables
plt.figure(figsize=(15,7))
sns.boxplot(data=df, x="________", y="________", palette="PuBu")  ## Complete the code to visualize the relationship between Make and Age using boxplot
plt.show()

##### Make vs Price

In [None]:
# boxplot to show relationship between two variables
plt.figure(figsize=(15,7))
sns.boxplot('________')  ## Complete the code to visualize the relationship between make and price using boxplot
plt.show()

##### Make vs Salary

In [None]:
# boxplot to show relationship between two variables
plt.figure(figsize=(15,7))
sns.boxplot(data=df, x="_______", y="_______", palette="PuBu") ## Complete the code to visualize the relationship between make and salary using boxplot
plt.show()

##### Make vs Education

In [None]:
# countplot to show relationship between two variables
plt.figure(figsize=(15,7))
sns.countplot(data=df, x="_______", hue="_______")  ## Complete the code to visualize the relationship between make and education using countplot
plt.legend(bbox_to_anchor=(1.00, 1))
plt.show()

##### Make vs No_of_Dependents

In [None]:
# countplot to show relationship between two variables
plt.figure(figsize=(15,7))
sns.countplot(data=df, x="_______", hue="_______")  ## Complete the code to visualize the relationship between make and no of dependents using countplot
plt.legend(bbox_to_anchor=(1.00, 1))
plt.show()

##### Make vs Profession

In [None]:
# countplot to show relationship between two variables
plt.figure(figsize=(15,7))
sns.countplot(data=df, x="_______", hue="_______")  ## Complete the code to visualize the relationship between make and profession using countplot
plt.legend(bbox_to_anchor=(1.00, 1))
plt.show()

##### Make vs Personal loan

In [None]:
# countplot to show relationship between two variables
plt.figure(figsize=(15,7))
sns.countplot(data=df, x="______", hue="______")  ## Complete the code to visualize the relationship between make and personal loan using countplot
plt.legend(bbox_to_anchor=(1.00, 1))
plt.show()

##### Make vs House loan

In [None]:
# countplot to show relationship between two variables
plt.figure(figsize=(15,7))
sns.countplot(data=df, x="______", hue="______")    ## Complete the code to visualize the relationship between make and House Loan using countplot
plt.legend(bbox_to_anchor=(1.00, 1))
plt.show()

##### Grouping data w.r.t to car types to build customer profiles

###### Grouping data w.r.t to Hatchback

In [None]:
df[df["Make"]=="_______"].describe(include="all") # Group data w.r.t to Hatchback to build customer profiles

###### Grouping data w.r.t to Sedan

In [None]:
df[df["Make"]=="_______"].describe(include="all") # Group data w.r.t to Sedan to build customer profiles

###### Grouping data w.r.t to SUV

In [None]:
df[df["Make"]=="_______"].describe(include="all") # Group data w.r.t to SUV to build customer profiles

##### Customer Segmentation

###### Profession vs Price

In [None]:
# boxplot to show relationship between two variables
plt.figure(figsize=(15,7))
sns.boxplot('________')  ## Complete the code to visualize the relationship between profession and price using boxplot
plt.show()



###### Education vs Price vs Make

In [None]:
# boxplot to show relationship between two variables
plt.figure(figsize=(15,7))
sns.boxplot('________')  ## Complete the code to visualize the relationship between Education, Price and Make using boxplot
plt.show()

###### Education vs Price

In [None]:
# boxplot to show relationship between two variables
plt.figure(figsize=(15,7))
sns.boxplot('________')  ## Complete the code to visualize the relationship between Education and Price  using boxplot
plt.show()

###### Profession vs Price vs Education

In [None]:
# boxplot to show relationship between three variables
plt.figure(figsize=(15,7))
sns.boxplot('________')  ## Complete the code to visualize the relationship between Profession, Price and Education using boxplot
plt.legend(bbox_to_anchor=(1.00, 1))
plt.show()

###### Gender vs Price vs Education

In [None]:
# boxplot to show relationship between three variables
plt.figure(figsize=(15,7))
sns.boxplot('________')  ## Complete the code to visualize the relationship between Profession, Price and Education using boxplot
plt.legend(bbox_to_anchor=(1.00, 1))
plt.show()

###### Number of Dependents vs Price vs Profession

In [None]:
# pointplot to show relationship between three variables
plt.figure(figsize=(15,7))
sns.pointplot('___________',hue="________")  ## Complete the code to visualize the relationship between Number of Dependents, Price and Profession using pointplot
plt.legend(bbox_to_anchor=(1.00, 1))
plt.show()

###### Age vs Price vs Make

In [None]:
# lineplot to show relationship between three variables
plt.figure(figsize=(15,7))
sns.lineplot('_______', hue="Make", ci=0) ## Complete the code to visualize the relationship between Age, Price and Make using pointplot
plt.legend(bbox_to_anchor=(1.00, 1))
plt.show()

###### Age vs Price vs House Loan

In [None]:
# lineplot to show relationship between three variables
plt.figure(figsize=(15,7))
sns.lineplot('___________',hue="House_loan", ci=0)## Complete the code to visualize the relationship between Age, Price and House Loan using pointplot
plt.legend(bbox_to_anchor=(1.00, 1))
plt.show()

#### Question 10: For customers who have 3 or fewer dependents, how does the average car price vary by profession? [2 marks]

In [None]:
df_dependents = df[df["___________"]<= 3] # Get the columns for no of dependents less than or equal to 3
df_profession = df_dependents.groupby(['Profession'])['Price'].mean().sort_values(ascending = False).reset_index()
df_profession

#### Question 11: For customers who have availed a home loan and a personal loan, how does the price vary by profession? [3 marks]

In [None]:
# Filter the data for House Loans and Personal Loans
df = df[(df['Personal_loan'] == '_______') & (df['House_loan'] == '_______')]

plt.figure(figsize=(15,7))
sns.boxplot('____________')   ## Complete the code to visualize the relationship between Price and Profession using boxplot
plt.show()

## Conclusion and Recommendations

#### **Question 12:** Write the conclusions and business recommendations derived from the analysis. (6 marks)