# Nike Shoes Analysis
### Data: Scrapped products, characteristics, descriptions, ratings, and reviews of the shoes section of the Nike website for both male and female products. Acquired data on over 1,200 products and more than 30 thousand reviews.
### Objective: To understand the product offering of Nike, its composition in terms of quantity, price and type of products. As well as, to understand it's perceived strengths, weaknesses, and how it delivers value to its clients.

## Install Required Packages

- Open **Terminal/Anaconda Prompt**, cd to the project and run the following command:
 - `pip install -r requirements.txt`

- After installing all the required packages, run the following command:
 - `python -m textblob.download_corpora`
 
- Restart this jupyter notebook.

## Importing Data and Cleaning

In [133]:
# Import libraries to be used

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
plt.style.use('ggplot')
import re

# Getting all the data from the csv files and joining them

df_men = pd.read_csv('./nike_shoes_men.csv', header=0, error_bad_lines = False, quotechar='"')
df_woman0 = pd.read_csv('./nike_shoes_woman0.csv', header=0, error_bad_lines = False, quotechar='"')
df_woman1 = pd.read_csv('./nike_shoes_woman1.csv', header=0, error_bad_lines = False, quotechar='"')

df = pd.concat([df_men, df_woman0, df_woman1])


In [137]:
df[df['size'].str.contains('[A-Za-z]', na=False)]

Unnamed: 0,id_,gender,title,url,category,price,description,description_long,n_reviews,score,size,comfort,durability,r_title,r_raiting,r_body,r_date


In [138]:
# Replace slider values that are not correct
df = df.replace('[A-Za-z]', value = {"size" : None, "comfort" : None, "durability" : None}, regex=True)

In [160]:
# Converting certain columns data types to correct category

df = df.astype({"gender": 'category', 'category': 'category',
                "size": 'float64', 'comfort': 'float64', 'durability': 'float64', 'r_date' : 'datetime64'})

print(df.shape)
print(df.dtypes)
df.describe()

(35960, 17)
id_                          int64
gender                    category
title                       object
url                         object
category                  category
price                      float64
description                 object
description_long            object
n_reviews                  float64
score                      float64
size                       float64
comfort                    float64
durability                 float64
r_title                     object
r_raiting                  float64
r_body                      object
r_date              datetime64[ns]
dtype: object


Unnamed: 0,id_,price,n_reviews,score,size,comfort,durability,r_raiting
count,35960.0,35950.0,35519.0,35519.0,35138.0,35138.0,35138.0,35519.0
mean,387.172442,102.204256,438.456347,4.587212,47.497083,85.052365,73.543557,4.585715
std,209.56151,50.799157,582.444342,0.279373,11.031876,10.063183,13.583633,0.900332
min,1.0,21.0,1.0,1.0,14.5,25.0,20.0,1.0
25%,193.0,65.0,57.0,4.5,41.5,80.0,68.0,5.0
50%,415.0,90.0,183.0,4.6,48.5,86.0,76.0,5.0
75%,552.0,125.0,652.0,4.8,54.0,93.0,83.5,5.0
max,739.0,400.0,2298.0,5.0,100.0,100.0,100.0,5.0


## Handling missing values

In [190]:
# Getting an idea of where the missing values are

df.isnull().sum()

id_                    0
gender                 0
title                  0
url                    0
category               0
price                  0
description            0
description_long     136
n_reviews            431
score                431
size                 812
comfort              812
durability           812
r_title             2309
r_raiting            431
r_body               431
r_date               431
dtype: int64

#### There are 10 items that could not be scraped correctly because the HTML formating on the page changed. Also some items don't have long descriptions, reviews, a size/comfort/durability slider result.
#### We should drop the 10 items out of the list because they won't provide any value, its a small part of the total of over 1,200 products. Also the other missing values are correct because they are not present in the product page, not all fields have information in the product page.

In [189]:
df = df[df['title'].notna()]