## Lab Session 

### Learning Objective:
- Working with data using python libaries.
- Data Visualization.
- Exploratory data analysis and data preprocessing.
- Building a Linear regression model to predict the tip amount based on different input features.

### About the dataset (Customer Tip Data)

#### Dataset Source: https://www.kaggle.com/datasets/ranjeetjain3/seaborn-tips-dataset

The dataset contains information about the 244 orders served at a restaurant in the United States. Each observation includes the factors related to the order like total bill, time, the total number of people in a group, gender of the person paying for the order and so on.

#### Attribute Information:

- **total_bill:** Total bill (cost of the meal), including tax, in US dollars
- **tip:** Tip in US dollars
- **sex:** Sex of person paying for the meal
- **smoker:** There is a smoker in a group or not
- **day:** Day on which the order is served
- **time:** Time of the order
- **size:** Size of the group

Food servers’ tips in restaurants may be influenced by many factors, including the nature of the restaurant, size of the party, and table locations in the restaurant. Restaurant managers need to know which factors matter when they assign tables to food servers. For the sake of staff morale, they usually want to avoid either the substance or the appearance of unfair
treatment of the servers, for whom tips (at least in restaurants in the UnitedStates) are a major component of pay.

### Import required libraries

In [2]:
import panda as pd

ModuleNotFoundError: No module named 'panda'

### Load the dataset

In [None]:
data = pd.read_csv('tips.csv')

In [3]:
data.head()

NameError: name 'data' is not defined

### 1. Make a list of categorical and numerical columns in the data.

In [None]:
data.dtypes

In [None]:
data.select_dtypes(['float64' , 'int64']).columns

In [None]:
data.select_dtypes(['object']).columns

### 2. Compute the average bill amount for each day.

In [None]:
data.groupby('day')['total_bill'].mean()

### 3. Which gender is more generous in giving tips?

In [None]:
data.groupby('sex')['tip'].sum()

### 4. According to the data, were there more customers for dinner or lunch?

In [None]:
data

In [None]:
data.groupby('time')['size'].sum()

### 5. Based on the statistical summary, comment on the variable 'tip'

In [None]:
data.describe()

### 6. Find the busiest day in terms of the orders?

In [None]:
data.groupby('day')['total_bill'].sum().sort_values(asending=False)

### 7. Is the variable 'total_bill' skewed? If yes, identify the type of skewness. Support your answer with a plot

In [None]:
data['total_bill'].skew()

In [None]:
import seaborn as sns

In [None]:
sns.distplot(data['total_bill'])

### 8. Is the tip amount dependent on the total bill? Visualize the relationship with a appropriate plot and metric and write your findings.

In [None]:
data[['tip','total_bill']].corr

In [None]:
sns.scatterplot(x='total_bill',y='tip',data=data)

### 9. What is the percentage of males and females in the dataset? and display it in the plot

In [None]:
sns.countplot(data['sex'])

In [None]:
data['sex'].value_counts(normalize=True)

### 10. Compute the gender-wise count based on smoking habits and display it in the plot

In [None]:
data.columns

In [None]:
pd.crosstab(data['sex'],data['smoker']).plot(kind='bar')

### 11. Compute the average tip amount given for different days and display it in the plot.

In [None]:
data.groupby('day')['tip'].mean()

In [None]:
import numpy as np

In [None]:
sns.barplot(x='day',y='tip',data=data,estimator=np.mean,ci=False)

### 12. Is the average bill amount dependent on the size of the group? Visualize the relationship using appropriate plot and write your findings.

In [None]:
sns.barplot(x='size',y='total_bill',estimator=np.mean,ci=True,data=data)

### 13. Plot a horizontal boxplot to compare the bill amount based on gender

In [None]:
sns.boxspot(x=data['sex'],y=data['total_bill'],orient='h')

### 14. Find the maximum bill amount for lunch and dinner on Saturday and Sunday

In [None]:
data.groupby('day','time')['total_bill'].max()

### 15. Compute the percentage of missing values in the dataset.

In [None]:
(data.isnull()sum()/len(data))*100

### 16. Is there are any duplicate records in the dataset? If yes compute the count of the duplicate records and drop them.

In [None]:
data[data.duplicated()]

In [None]:
data

In [None]:
data_n

In [None]:
data_n.drop_duplicates()

In [None]:
data_n

### 17. Is there are any outliers present in the column 'total_bill'? If yes treat them with transformation approach, and plot a boxplot before and after the treatment

In [None]:
sns.boxplot(np.log(data['total_bill']))

### 18. Is there are any outliers present in the column 'tip'? If yes remove them using IQR techinque.

In [None]:
sns.boxplot(data['tip'])

In [None]:
q1 = data['tip'].quantile(0.25)
q2 = data['tip'].quantile(0.5)
q3 = data['tip'].quantile(0.75)

In [None]:
iqr=q3-q1

In [None]:
upper_limit = q3+1.5*(iqr)

In [None]:
upper_limit

In [None]:
lower_limit = q1-1.5*(iqr)

In [None]:
lower_limit

In [None]:
data

In [None]:
data_wo_out = data.loc[(data['tip'] <= lower_limit) & (data['tip'] >= upper_limit)]

In [None]:
data_wo_out

In [None]:
sns.boxplot(data_wo_out['tip'])

### 19. Encode the categorical columns in the dataset and print the random 5 samples from the dataframe.

In [None]:
data.dtypes

In [None]:
cat_data = data[['sex','smoker','day','time']]

In [None]:
cat_data = pd.get_dummies(cat_data,drop_first=True)

In [None]:
cat_data

### 20. Check the range of the column 'total_bill' and transform the values such that the range will be 1.

In [None]:
data['total_bill'].min()

In [None]:
data['total_bill'].max()

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
mm = MinMaxScaler()

In [None]:
data['total_bill'] = mm.fit_transform(data[['total_bill']])

In [None]:
data['total_bill'].max()

In [None]:
data['total_bill'].min()

### 21. Load the dataset again by giving the name of the dataframe as "tips_df"
- i) Encode the categorical variables.
- ii) Store the target column (i.e.tip) in the y variable and the rest of the columns in the X variable

In [None]:
tips_df = data.copy()

In [None]:
X = data.drop('tip',axis=1)b

In [None]:
y = tips_df['tip']

### 22. Split the dataset into two parts (i.e. 70% train and 30% test), and Standardize the columns "total_bill" and "Size" using the mim_max scaling approach

### 23. Train a linear regression model using the training data and print the r_squared value of the prediction on the test data.

### Happy Learning:)