<a href="https://colab.research.google.com/github/SireeshaM6/Advanced_House_Dataset/blob/main/Animal_Behaviour_Hypothesis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Agenda

1. Data Quality Check

In [None]:
## Data Manipulation Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Data Preprocessing Libraries
from sklearn.preprocessing import StandardScaler


## Dataset validation Libraries
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.feature_selection import VarianceThreshold

In [None]:
data = pd.read_csv('/content/abp_accel.csv')
data.head()

Unnamed: 0,timestamp,x,y,z
0,2015-06-12 13:30:00.161041,100,620,804
1,2015-06-12 13:30:00.260490,68,640,800
2,2015-06-12 13:30:00.359939,48,628,884
3,2015-06-12 13:30:00.459388,44,616,888
4,2015-06-12 13:30:00.558837,76,628,860


**Conclusions:**
Timestamp : yyyy-mm-dd hh:MM:ss.microseconds

Acceleration_Data : The accleration units are in Milligramgram force (MG)

In [None]:
data.shape

(1441669, 4)

In [None]:
14595853

## Duplicate Validation

In [None]:
# Selecting duplicated rows except first occurence based on all columns
duplicate = data[data.duplicated()]
duplicate

Unnamed: 0,timestamp,x,y,z


**Conclusions:**

No duplicates are available

**NULL values validation**

In [None]:
data.isnull().sum()

timestamp    0
x            0
y            0
z            1
dtype: int64

**Conclusions:**
Since we have only once null value record. We can remove that null value record.

In [None]:
data = data.dropna()

In [None]:
data.isnull().sum()

timestamp    0
x            0
y            0
z            0
dtype: int64

## Descriptive Statistics

In [None]:
data.describe()

Unnamed: 0,x,y,z
count,13291440.0,13291440.0,13291440.0
mean,46.53656,890.2614,162.7622
std,346.4005,205.6476,339.3956
min,-1840.0,-1828.0,-2040.0
25%,-240.0,828.0,-8.0
50%,168.0,968.0,120.0
75%,320.0,1016.0,320.0
max,1684.0,1904.0,2040.0


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13291436 entries, 0 to 13291435
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   timestamp  object 
 1   x          int64  
 2   y          int64  
 3   z          float64
dtypes: float64(1), int64(2), object(1)
memory usage: 507.0+ MB


## Conclusions: Descriptive Statistics
From the descriptive Statistics, we can see mean and max values are having larger difference. It means we have outliers in the data.

## **Data Visualizations : Univariate Analysis**

In [None]:
for col in data.columns:
  plt.figure(figsize=(8,4))
  ax = sns.histplot(data[col],kde=True,bins=20)
  plt.xticks(rotation=45)
  plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.boxplot(x='x',data=data)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.boxplot(x='y',data=data)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.boxplot(x='z',data=data)
plt.show()

In [None]:
data.shape

In [None]:
data_df = data[data['z'] >= -1000]
data_df.shape

**Conclusions**

From the Z feature, felt we have outliers in the lower whisker. Removed those 4 rows

In [None]:
sns.barplot(x='x',y='y',hue='z',data = data_df)
plt.show()

**Convert the Object to Datetime**

In [None]:
data_df['timestamp'] = pd.to_datetime(data_df['timestamp'],format='%Y-%m-%d %H:%M:%S.%f')


In [None]:
data_df.head()

In [None]:
data_df.head(13)

In [None]:
data_df.info()

In [None]:
# sns.pairplot(data_df)
# plt.show()

**Conclusions:**
Since the distribution of the data is very dense like above. We can use DBSCAN for the Clustering purpose.

In [None]:
for col in data_df.columns:
  plt.figure(figsize=(8,4))
  ax = sns.histplot(data_df[col],kde=True,bins=20)
  plt.xticks(rotation=45)
  plt.show()

## Multicollinearity Check

In [None]:
data_df1 = data_df.drop(['timestamp'],axis=1)

In [None]:
vif_data = pd.DataFrame()
vif_data['feature'] = data_df1.columns


# Calculating the VIF for each feature
vif_data['VIF'] = [variance_inflation_factor(data_df1.values,i) for i in range(len(data_df1.columns))]

vif_data = vif_data.sort_values('VIF',ascending=False)

vif_data


In [None]:
plt.figure(figsize=(10,7))
ax = sns.barplot(y='feature',x ='VIF',data = vif_data)
plt.show()

## Conclusions : Multicollinearity Validation

Since the VIF score is less than 10, All features are independent to each other

# Importance of each feature

In [None]:
corr = data_df.corr()
top_features = corr.index
plt.figure(figsize=(6,6))
sns.heatmap(data_df[top_features].corr(),annot=True)
plt.show()

## Conclusions : Pearson Correlation Matrix

We could see a Strongly negatively correlated. Means if one feature increases another featuer decreases.

## Feature Selection - Dropping Constant Features.
It helps to find the low variance features.

In [None]:
var_thres = VarianceThreshold(threshold = 0)
var_thres.fit(data_df1)

In [None]:
var_thres.get_support()

In [None]:
data_df1.columns[var_thres.get_support()]

In [None]:
constant_columns = [column for column in data_df1.columns if column not in data_df1.columns[var_thres.get_support()]]


print(constant_columns)

## Conclusions : Constant Features
In the Dataset, we don't see any constant features. The variance of the features are very good.