## EDA: Null values

In [1]:
import pandas as pd

### reading our data (.csv file) into dataframe

In [5]:
scraped_data_bt = pd.read_csv('scraped_csv_one.csv',parse_dates=True, engine='python')
scraped_data_bt.head()

Unnamed: 0,content,location,date
0,Bt are awful - terrible communication. I wante...,TOTNES,2021-11-22
1,Very poor service inspite of being with BT for...,,
2,Absolutely rubbish customer service,wakefield,2021-11-19
3,Shambles of a company. I couldn`t use the Broa...,Stockport,2021-11-14
4,The service they provide is not close to that ...,Kirkcaldy,2021-11-14


### Lets drop rows with null values in all columns

(we observe that there are no rows that contains null value in all columns)

In [3]:
scraped_data_bt.dropna(axis=0, how='all')

Unnamed: 0,content,location,date
0,Bt are awful - terrible communication. I wante...,TOTNES,2021-11-22
1,Very poor service inspite of being with BT for...,,
2,Absolutely rubbish customer service,wakefield,2021-11-19
3,Shambles of a company. I couldn`t use the Broa...,Stockport,2021-11-14
4,The service they provide is not close to that ...,Kirkcaldy,2021-11-14
...,...,...,...
6026,Almost impossible to contact BT when problems ...,Leek,2009-07-29
6027,Nothing but trouble - needed new land line ins...,"Egremont, Cumbria",2009-07-28
6028,I use to get Just over 6 at a previous address...,Barnstaple Devon,2009-07-28
6029,Shocked and disappointed by the results of the...,Rosyth,2009-07-28


### Lets check for duplicate values in our dataframe

In [4]:
scraped_data_bt[scraped_data_bt.duplicated()].count()

content     118
location     54
date         54
dtype: int64

<div class="alert alert-block alert-warning">
On furthur investigation, we found 2 reasons for duplications.
    
1. Duplicates exist in the data source (https://www.broadband.co.uk/broadband/providers/bt/#reviewers)
2. Our web scraping code was duplicating the data from pages 1-4. 
</div>

<div class="alert alert-block alert-info">
Actions taken:
    
1. Duplicates that exists in the data source is deleted.
2. web scraping code is rectified.
</div>

In [6]:
bt_data = scraped_data_bt.drop_duplicates()
bt_data

Unnamed: 0,content,location,date
0,Bt are awful - terrible communication. I wante...,TOTNES,2021-11-22
1,Very poor service inspite of being with BT for...,,
2,Absolutely rubbish customer service,wakefield,2021-11-19
3,Shambles of a company. I couldn`t use the Broa...,Stockport,2021-11-14
4,The service they provide is not close to that ...,Kirkcaldy,2021-11-14
...,...,...,...
6026,Almost impossible to contact BT when problems ...,Leek,2009-07-29
6027,Nothing but trouble - needed new land line ins...,"Egremont, Cumbria",2009-07-28
6028,I use to get Just over 6 at a previous address...,Barnstaple Devon,2009-07-28
6029,Shocked and disappointed by the results of the...,Rosyth,2009-07-28


### Lets check for null values in our dataframe

In [7]:
bt_data.isnull().sum()

content        3
location    2796
date        2739
dtype: int64

<div class="alert alert-block alert-warning">
Observation/Reasons for null values.
    
1. Content column has 3 rows of null which is observed in the data source as well.
2. Location and date columns has large number of null values. It was observed that the code for scraping data had a bug which was splitting reviews based on paragraphs (or \n) and null values were getting inserted in the location and date columns (2739 rows). 
3. The remaining 57 values of null values in the location column was missing from our data souce as well.
    (https://www.broadband.co.uk/broadband/providers/bt/#reviewers)
</div>

<div class="alert alert-block alert-info">
Actions taken:
    
1. web scraping code is rectified to avoid inserting null values in the location and date column.
2. Remaining 57 null values in location column is replaced with 'UK' since all our reviews are based from UK.
</div>

In [9]:
bool_series_loc = pd.isnull(bt_data['location'])
bool_series_date = pd.isnull(bt_data['date'])

replace_null_loc = bt_data[bool_series_loc & ~bool_series_date]
replace_null_loc['location'].fillna('UK', inplace = True)
replace_null_loc

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)


Unnamed: 0,content,location,date
473,Terrible on all levels. Had a multiple problem...,UK,2020-10-24
580,Utterly abysmal! Constant disconections never...,UK,2020-05-13
643,Cant even open a web page even when no one els...,UK,2020-01-25
721,On top of that the complaints procedure is obv...,UK,2019-10-06
829,This didn't happen. Weeks and many phone call...,UK,2019-06-07
869,I do not recommend BT they promised me a £39.9...,UK,2019-04-27
1014,We have the Infinity package which averages 0....,UK,2018-12-21
1043,Following resetting of an e-mail password by a...,UK,2018-12-03
1136,BT business i would never recommend this compa...,UK,2018-10-03
1143,Every 5 minutes the internet would fail and ki...,UK,2018-09-30


In [10]:
# to merge replaced values with original dataset
# orig_df.combine_first(new_df)

bt_data = bt_data.combine_first(replace_null_loc)
bt_data

Unnamed: 0,content,location,date
0,Bt are awful - terrible communication. I wante...,TOTNES,2021-11-22
1,Very poor service inspite of being with BT for...,,
2,Absolutely rubbish customer service,wakefield,2021-11-19
3,Shambles of a company. I couldn`t use the Broa...,Stockport,2021-11-14
4,The service they provide is not close to that ...,Kirkcaldy,2021-11-14
...,...,...,...
6026,Almost impossible to contact BT when problems ...,Leek,2009-07-29
6027,Nothing but trouble - needed new land line ins...,"Egremont, Cumbria",2009-07-28
6028,I use to get Just over 6 at a previous address...,Barnstaple Devon,2009-07-28
6029,Shocked and disappointed by the results of the...,Rosyth,2009-07-28
