# Gym Crowdedness Analysis with PCA

> # Objective : 

To **predict** how crowded a university gym would be at a given time of day (and some other features, including weather)

> # Data Decription : 

The dataset consists of 26,000 people counts (about every 10 minutes) over one year. The dataset also contains information about the weather and semester-specific information that might affect how crowded it is. The label is the number of people, which has to be predicted given some subset of the features.

**Label**:

- Number of people

**Features**:

     1. date (string; datetime of data)
     2. timestamp (int; number of seconds since beginning of day)
     3. dayofweek (int; 0 [monday] - 6 [sunday])
     4. is_weekend (int; 0 or 1) [boolean, if 1, it's either saturday or sunday, otherwise 0]
     5. is_holiday (int; 0 or 1) [boolean, if 1 it's a federal holiday, 0 otherwise]
     6. temperature (float; degrees fahrenheit)
     7. isstartof_semester (int; 0 or 1) [boolean, if 1 it's the beginning of a school semester, 0 otherwise]
     8. month (int; 1 [jan] - 12 [dec])
     9. hour (int; 0 - 23)

> # Approach

The model would be built and PCA would be implemented in the following way : 

- **Data Cleaning and PreProcessing**
- **Exploratory Data Analysis :**
  
      - Uni-Variate Analysis : Histograms , Distribution Plots
      - Bi-Variate Analysis : Pair Plots
      - Correlation Matrix
      
- **Processing :**
      
      - OneHotEncoding 
      - Feature Scaling : Standard Scaler

- **Splitting Dataset** 
- **Principal Component Analysis**
- **Modelling : Random Forest**

      - Random forest without PCA
      - Random Forest with PCA
      
- **Conclusion**

## `1` Data Cleaning and PreProcessing 

**Importing Libraries and loading Dataset**

In [1]:
import numpy as np # linear algebra
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
df=pd.read_csv(r'C:\Users\kusht\OneDrive\Desktop\Excel-csv\PCA analysis.csv') #Replace it with your path where the data file is stored
df.head()

Unnamed: 0,number_people,date,timestamp,day_of_week,is_weekend,is_holiday,temperature,is_start_of_semester,is_during_semester,month,hour
0,37,2015-08-14 17:00:11-07:00,61211,4,0,0,71.76,0,0,8,17
1,45,2015-08-14 17:20:14-07:00,62414,4,0,0,71.76,0,0,8,17
2,40,2015-08-14 17:30:15-07:00,63015,4,0,0,71.76,0,0,8,17
3,44,2015-08-14 17:40:16-07:00,63616,4,0,0,71.76,0,0,8,17
4,45,2015-08-14 17:50:17-07:00,64217,4,0,0,71.76,0,0,8,17


**TASK : Print the `info()` of the dataset**

In [3]:
### START CODE HERE (~ 1 Line of code)

### END CODE

**TASK : Describe the dataset using `describe()`**

In [4]:
### START CODE HERE (~ 1 Line of code)

### END CODE

**TASK : Convert temperature in farenheit into celsius scale using the formula `Celsius=(Fahrenheit-32)* (5/9)`**

In [5]:
### START CODE HERE (~1 Line of code)

### END CODE

**TASK : Convert the timestamp into hours in 12 h format as its currently in seconds and drop `date` coulmn**

In [6]:
### START CODE HERE: (~ 1 Line of code)

### END CODE

## `2` Exploratory Data Analysis

### `2.1` Uni-Variate and Bi-Variate Analysis

- **Pair Plots**

**TASK : Use `pairplot()` to make different pair scatter plots of the entire dataframe**

In [7]:
### START CODE HERE :

### END CODE

**TASK: Now analyse scatter plots between `number_people` and all other attributes using a `for loop` to properly know what are the ideal conditions for people to come to the gym** 

In [8]:
### START CODE HERE 
    
### END CODE

**Analyse the plots and understand :**
1. **At what time , temperature , week of the day more people come in?**
        
2. **Whether people like to come to the gym in a holiday or a weekend or they prefer to come to gym during working days?**
       
3. **Which month is most preferable for people to come to the gym?** 

- **Distribution Plots**

**TASK : Plot individual `distplot()` for `temperature` and `number_people` to check out the individual distribution of the attributes** 

In [9]:
### START CODE HERE : 

### END CODE

### `2.2` Correlation Matrix

**TASK : Plot a correlation matrix and make it more understandable using `sns.heatmap`**

In [10]:
### START CODE HERE : 


### END CODE HERE 

**Analyse the correlation matrix and understand the different dependencies of attributes on each other** 

## `3.` Processing : 

### `3.1` One hot encoding :
One hot encoding certain attributes to not give any ranking/priority to any instance

**TASK: One Hot Encode following attributes `month` , `hour` , `day of week`**

In [11]:
## YOU CAN USE EITHER get_dummies() OR OneHotEncoder()

### START CODE HERE : 

### END CODE 

### `3.2` Feature Scaling :
Some attributes ranges are ver different compared to other values and during PCA implementation this might give a problem thus you need to standardise some of the attributes

**TASK: Using `StandardScaler()` , standardise `temperature` and `timestamp`**

In [55]:
## You can use two individual scalers one for temperature and other for timestamp
## you can use an array type data=df.values and standradise data then split data into X and y
from sklearn.preprocessing import StandardScaler
### START CODE HERE : (Replace places having '#' with the code)
data=df.values
scaler1 = StandardScaler()
scaler1.fit(#) # for timestamp
data[#] = scaler1.transform(#)

scaler2 = StandardScaler()
scaler2.fit(data[#]) # for temperature
data[#] = scaler2.transform(data[#])

### END CODE HERE

## `4.` Splitting the dataset : 

**TASK : Split the dataset into dependent and independent variables and name them y and X respectively** 

In [12]:
### START CODE HERE : 

### END CODE

**TASK : Split the X ,y into training and test set**

In [13]:
from sklearn.model_selection import train_test_split
### START CODE HERE : 

### END CODE

## `5.` Principal Component Analysis 

Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance.

**How does it work? :**

- First, a matrix is calculated that summarizes how our variables all relate to one another.

- Secondly , The matrix is broken down into two separate components: direction and magnitude. so its easy to understand the “directions” of the data and its “magnitude” (or how “important” each direction is). The photo below, displays the two main directions in this data: the “red direction” and the “green direction.” In this case, the “red direction” is the more important one as given how the dots are arranged, “red direction”  comprises most of the data and thus is s more important than the “green direction” (Hint: Think of  What would fitting a line of best fit to this data look like?)


<img src="https://miro.medium.com/max/832/1*P8_C9uk3ewpRDtevf9wVxg.png">

- Then the data is transformed to align with these important directions (which are combinations of our original variables). The photo below is the same exact data as above, but transformed so that the x- and y-axes are now the “red direction” and “green direction.”  What would the line of best fit look like here?

<img src="https://miro.medium.com/max/1400/1*V3JWBvxB92Uo116Bpxa3Tw.png">

So PCA tries to find the most important directions in which most of the data is spread and thus reduces it to those components thereby reducing the number of attributes to train and increasing computational speed. A 3D example is given below : 

<img src="https://miro.medium.com/max/1024/1*vfLvJF8wHaQjDaWv6Mab2w.png">

As you can see above a 3D plot is reduced to a 2d plot still retaining most of the data

**Now that you have understood this , lets try to implement it** 

**TASK : Print the PCA fit_transform of X(independent variables)**

In [14]:
from sklearn.decomposition import PCA

### START CODE HERE : (Replace spaces having '#' with the code)
pca = PCA()
pca.fit_transform(#)

### END CODE

**TASK : Get covariance using `get_covariance()`**

In [15]:
### START CODE HERE (~ 1 line of code) 

### END CODE HERE

**TASK : Get explained variance using `explained_variance_ratio`**

In [16]:
### START CODE HERE : 

### END CODE

**TASK : Plot a bar graph of `explained variance`**

In [None]:
# you can use plt.bar()

### START CODE HERE : (Replace spaces having '#' with the code)
with plt.style.context('dark_background'):
    plt.figure(figsize=(15,12))

    plt.bar(range(49), '#', alpha=0.5, align='center',
            label='individual explained variance')
    plt.ylabel('#')
    plt.xlabel('#')
    plt.legend(loc='best')
    plt.tight_layout()

### END CODE

**Analyse the plot and estimate how many componenets you want to keep**

**TASK : Make a `PCA()` object with n_components =20 and fit-transform in the dataset (X) and assign to a new variable `X_new`**

In [17]:
### START CODE HERE : 

### END CODE

Now , `X_new` is the dataset for PCA

**TASK : Get Covariance using `get_covariance`**

In [18]:
### START CODE HERE (~1 Line of code)

### END CODE

**TASK : Get the explained variance using `explained_variance_ratio`**

In [19]:
### START CODE HERE :


### END CODE

**TASK : Plot bar plot of `exlpained variance`**

In [20]:
# You can use plt.bar()

### START CODE HERE: 
    
### END CODE

## `6.` Modelling : Random Forest

To understand Random forest classifier , lets first get a brief idea about Decision Trees in general. Decision Trees are very intuitive and at everyone have used this knowingly or unknowingly at some point . Basically the model keeps sorting them into categories forming a large tree by responses of some questons (decisions) and thats why its called decision tree. An image example would help understand it better :

<img src="https://camo.githubusercontent.com/960e89743476577bd696b3ac16885cf1e1d19ad1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f313030302f312a4c4d6f4a6d584373516c6369475445796f534e3339672e6a706567">

`Random Forest` : Random forest, like its name implies, consists of a large number of individual decision trees that operate as an [ensemble](https://en.wikipedia.org/wiki/Ensemble_learning) . Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.

<img src="https://camo.githubusercontent.com/30aec690ddc10fa0ae5d3135d0c7a6b745eb5918/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f313030302f312a56484474566144504e657052676c49417637324246672e6a706567">

The fundamental concept is large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models. Since this dataset has very low correlation between attributes , random forest can be a good option.

In this section you'll have to make a random forest model and train it on both without PCA dataset and with PCA datset to analyse the differences

### `6.1` Random Forest Without PCA


**TASK : Make a random forest model and train it on without PCA training set**

In [71]:
# Establish model
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()

In [None]:
# Try different numbers of n_estimators and print the scores
# You can use a variable estimators = np.arrange(10,200,10) and then a for loop to take all the values of estimators

### START CODE HERE : (Replace spaces having '#' with code)
estimators = np.arange(10, 200, 10)
scores = []
for n in estimators:
    model.set_params(n_estimators='#')
    model.fit('#', '#')
    scores.append(model.score(X_test, y_test))
print(scores)    

### END CODE HERE

**TASK : Make a plot between `n_estimator` and `scores` to properly get the best number of estimators**

In [21]:
## Use plt.plot

### START CODE HERE : 

### END CODE HERE

### `6.2` Random Forest With PCA

**TASK : Split the your dataset with PCA into training and testing set using `train_test_split`** 

In [22]:
from sklearn.model_selection import train_test_split
### START CODE HERE  :

### END CODE

**TASK : Make a random forest model called `model_pca` and fit it into the new X_train and y_train and then print out the random forest scores for dataset with PCA applied to it**

In [75]:
# Establish model
from sklearn.ensemble import RandomForestRegressor
model_pca = RandomForestRegressor()

In [23]:
# You can use different number of estimators
# # You can use a variable estimators = np.arrange(10,200,10) and then a for loop to take all the values of estimators

### START CODE HERE : 

### END CODE

**TASK : Make a plot between `n_estimator` and `score` and find the best parameter** 

In [24]:
# you can use plt.plot
### START CODE HERE : 


### END CODE

This completes modelling and now its time to analyse your models

## `7.` Conclusion

Analyse the plots and find the best n_estimator. you can also hypertune other parameter using GridSearchCV or Randomised search. Also understand whether using PCA was beneficial or not , if not try to justify it. 