# **Data Analysis Phase**
### In this we will focus on creating Machine Learning Pipelines considering all the life cycle of Data Science Projects. This will be important for professionals who have not worked with huge Dataset.

### The main aim of the project is to predict the house price based on various feature.


# **Import the Dependencies**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# **Import the Data**

In [None]:
dataset = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')

In [None]:
dataset.head(10)
#head is used to display the top rows and columns

In [None]:
dataset.shape
# shape is a tuple that always gives dimensions of the array. 
# The shape is a tuple that gives you an indication of the no. of dimensions in the array.
# The shape function for numpy arrays returns the dimensions of the array. 

# **Data Analysis Phase**

### 1. Missing Values
### 2. All The Numerical Variables
### 3. Distribution of the Numerical Variables
### 4. Categorical Variables
### 5. Cardinality of Categorical Variables
### 6. Outliers
### 7. Relationship between independent and dependent feature(SalePrice)

# **Missing Values**

In [None]:
## Here we will check the percentage of nan values present in each feature
## Step1:- make the list of features which has missing values
features_with_na = [features for features in dataset.columns if dataset[features].isnull().sum() > 1]

for feature in features_with_na:
    print(feature, np.round(dataset[feature].isnull().mean(), 4), ' % missing values')

**Since they are many missing values, we need to find the relationship between missing values and Sales Price
Let's plot some diagram for this realtionship**

In [None]:
for feature in features_with_na:
    data = dataset.copy()
    
    # Let's make a variable that indicates 1 if observation was missing else 0
    data[feature] = np.where(data[feature].isnull(), 1, 0)
    
    # Let's calculate the mean SalePrice where the information is missing or present
    data.groupby(feature)['SalePrice'].median().plot.bar()
    plt.title(feature)
    plt.show()

**Here With the realtion between the missing values and dependent variable is clearly visible. So We need to replace these nan values with something meaningful.
From the above dataset some of the features like "Id" is not required.**

In [None]:
print("Id of Houses {}".format(len(dataset['Id'])))

# **Numerical Variable**

In [None]:
# list of numerical variables
numerical_features = [features for features in dataset.columns if dataset[features].dtypes != 'O']

print("Number of Numerical Features is {}".format(len(numerical_features)))

# visualise the numerical variables
dataset[numerical_features].head()

## **Temporal Variables(Eg: Datetime Vairbale)**
### From the dataset we have 4 year vaiables. We have extract information from the datetime variables like no. of years or no. of days. One example in this specific scenario can be difference in years between the year the house was built and the house was sold. We will be performing this analysis in the Feature Engineering.

In [None]:
# list of variables that  contain year information 
year_feature = [feature for feature in numerical_features if 'Yr' in feature or 'Year' in feature]

year_feature

In [None]:
# Let's explore the content of these year variables
for feature in year_feature:
    print(feature, dataset[feature].unique())

In [None]:
## Let's analyze the temoral Datetime Variables
## We will check whether these is a relation between year the house is sold and year

dataset.groupby('YrSold')['SalePrice'].median().plot()
plt.xlabel('Year Sold')
plt.ylabel('Median House Price')
plt.title("House Price vs YearSold")

In [None]:
## Here we will explore the difference between ALl years features with SalePrice

for feature in year_feature:
    if feature != 'YrSold':
        data = dataset.copy()
        
        # We will capture the difference between year variable and year the house was sold for
        data[feature] = data['YrSold'] - data[feature]
        plt.scatter(data[feature], data['SalePrice'])
        plt.xlabel(feature)
        plt.ylabel('SalePrice')
        plt.show()
        

In [None]:
print(dataset['YearBuilt'].head())

In [None]:
print(dataset['YrSold'].head())

In [None]:
print(dataset['SalePrice'].head())

In [None]:
dataset['YearBuilt'].sort_values().unique()

## 2 Types of Numerical Variables:
### 1. Continuous
### 2. Discrete

In [None]:
discrete_features = [feature for feature in numerical_features if len(dataset[feature].unique()) < 25 and feature not in year_feature+['Id']]

print("Discrete Variables {}".format(len(discrete_features)))

In [None]:
discrete_features

In [None]:
## Let's find the relationship between them and Sale Price

for feature in discrete_features:
    data = dataset.copy()
    data.groupby(feature)['SalePrice'].median().plot.bar()
    plt.xlabel(feature)
    plt.ylabel('SalePrice')
    plt.title(feature)
    plt.show()

In [None]:
continous_features = [feature for feature in numerical_features if feature not in discrete_features and feature not in year_feature + ['Id']]
print("Continuous Features {}".format(len(continous_features)))

In [None]:
dataset['ScreenPorch'].unique()

In [None]:
for feature in continous_features:
    data = dataset.copy()
    data[feature].hist(bins=25)
    plt.xlabel(feature)
    plt.ylabel("Count")
    plt.title(feature)
    plt.show()

## Exploratory Data Analysis Part 2
### we will be using logarithmic transformation

In [None]:
for feature in continous_features:
    data = dataset.copy()
    if 0 in data[feature].unique():   # we are excluding 0 value bcz, log 0 is undefined
        pass
    else:
        data[feature] = np.log(data[feature])
        data['SalePrice'] = np.log(data['SalePrice'])
        plt.scatter(data[feature], data['SalePrice'])
        plt.xlabel(feature)
        plt.ylabel('SalePrice')
        plt.title(feature)
        plt.show()

### Outliers: it means suppose in a distribution some values are which is very high or very low

In [None]:
for feature in continous_features:
    data = dataset.copy()
    if 0 in data[feature].unique():
        pass
    else:
        data [feature] = np.log(data[feature])
        data.boxplot(column=feature)
#         plt.xlabel(feature)
        plt.ylabel(feature)
        plt.title(feature)
        plt.show()

# **Categorical Features**

In [None]:
categorical_features = [feature for feature in dataset.columns if dataset[feature].dtypes == 'O']
print("Categorical features {}".format(len(categorical_features)))

In [None]:
dataset[categorical_features].head()

In [None]:
for feature in categorical_features:
    print("The feature is {} and number of categories are {}".format(feature, len(data[feature].unique())))

**Relationship b/w categorical variable and dependent feature('SalePrice)**

In [None]:
for feature in categorical_features:
    data = dataset.copy()
    data.groupby(feature)['SalePrice'].median().plot.bar()
    plt.xlabel(feature)
    plt.ylabel('SalePrice')
    plt.title(feature)
    plt.show()

### If you find this notebook helpful kindly toss a upvote🧡