# <u>Data Cleansing<u/>
* In this part we will look for duplicated data, missing values and outliers.

In [1]:
# ~~~ Imports ~~~
import requests
import json
import bs4
from bs4 import BeautifulSoup  
import pandas as pd
import scipy as sc
import numpy as np

In [2]:
# ~~~ Load csv function ~~~
def load_csv(file_name):
    df = pd.read_csv(file_name)
    return df.drop(columns = 'Unnamed: 0').copy()

In [3]:
# ~~~ Load data frames from csv files ~~~
df = load_csv('Virus Full DataFrame.csv')
df_numerical = load_csv('Virus Full DataFrame Numerical.csv')

### <u>Duplicated Data<u/>

In [4]:
print('Number of duplicated rows: ', df.duplicated().sum())
df[df.duplicated()]

Number of duplicated rows:  0


Unnamed: 0,virus name,species,genus,family,Host plasticity No of species,Host plasticity No of species Impact,Host plasticity No of orders,Host plasticity No of orders Impact,Geography of the host,Geography of the host Impact,...,Proportion of known human pathogens in the viral family Impact,Transmission mode of the virus,Transmission mode of the virus Impact,Animal to human transmission,Animal to human transmission Impact,Human to human transmission,Human to human transmission Impact,Duration of virus species infection in humans,Duration of virus species infection in humans Impact,Risk Score


* <b>Our data frame doesn't have duplicates.<b/>

### <u>Missing Values<u/> 

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887 entries, 0 to 886
Data columns (total 67 columns):
 #   Column                                                                                                       Non-Null Count  Dtype 
---  ------                                                                                                       --------------  ----- 
 0   virus name                                                                                                   887 non-null    object
 1   species                                                                                                      887 non-null    object
 2   genus                                                                                                        887 non-null    object
 3   family                                                                                                       887 non-null    object
 4   Host plasticity No of species                                       

* <b>Our data frame doesn't have missing values.<b/>
* <b>Although we don't have missing values, we found out that we have Unassigned values under 'genus' column:<b/>

In [6]:
df['virus name'].describe()

count                           887
unique                          887
top       ADENOVIRUS PREDICT_ADV-26
freq                              1
Name: virus name, dtype: object

In [7]:
df['species'].describe()

count                                                   887
unique                                                  874
top       Severe acute respiratory syndrome-related coro...
freq                                                      5
Name: species, dtype: object

In [8]:
df['genus'].describe()

count            887
unique            31
top       Unassigned
freq             304
Name: genus, dtype: object

In [9]:
df['family'].describe()

count              887
unique              19
top       Astroviridae
freq               235
Name: family, dtype: object

- <b>As we can see, there are 304 Unassigned values under 'genus' column, we tried to count all Unassigned values for the following columns: 'virus name', 'species', 'genus', 'family'.</b>

In [10]:
virus_name = df['virus name']
species = df['species']
genus = df['genus']
family = df['family']

count = 0
for v in virus_name:
    if v == 'Unassigned':
        count +=1;
print('Virus name Unassigned values: ' + str(count))

count = 0
for s in species:
    if s == 'Unassigned':
        count +=1;
print('Species Unassigned values: ' + str(count))        
        
count = 0        
for g in genus:
    if g == 'Unassigned':
        count +=1;
print('Genus Unassigned values: ' + str(count)) 

count = 0      
for f in family:
    if f == 'Unassigned':
        count +=1;       
print('Family Unassigned values: ' + str(count))

Virus name Unassigned values: 0
Species Unassigned values: 0
Genus Unassigned values: 304
Family Unassigned values: 0


- <b>We didn't found any more Unassigned values, as for the 'Species' column's Unassigned values, We decided to leave it as it is.</b> 

### <u>Outliers<u/> 
* In this part we will only verify that our numerical dataframe column's values are valid and in range.
  In our numerical data frame every risk factor can get values between 1-5, Risk factor impact can get values between 1-3,
  And the risk score can get values between 0-155.
  We will check if we have any data that goes above or below these limits. 

In [11]:
for index, col in enumerate(df_numerical.columns):
    print(index, col)

0 virus name
1 species
2 genus
3 family
4 Host plasticity No of species
5 Host plasticity No of species Impact
6 Host plasticity No of orders
7 Host plasticity No of orders Impact
8 Geography of the host
9 Geography of the host Impact
10 Number of primary high risk disease transmission interfaces where the virus has been detected
11 Number of primary high risk disease transmission interfaces where the virus has been detected Impact
12 Frequency of interaction between domestic animals and humans in the host ecosystem
13 Frequency of interaction between domestic animals and humans in the host ecosystem Impact
14 Intimacy of interaction between domestic animals and humans in the host ecosystem
15 Intimacy of interaction between domestic animals and humans in the host ecosystem Impact
16 Frequency of interaction between wild animals and humans in the host ecosystem
17 Frequency of interaction between wild animals and humans in the host ecosystem Impact
18 Intimacy of interaction between wi

- first we will look on the risk factors and risk factors impact columns unique values:

In [12]:
for col_num in range(4, 66):
    print(df_numerical[df_numerical.columns[col_num]].unique())

[3 5 4 1 2]
[3]
[1 4 3 2]
[3]
[4 5 3 2]
[3]
[1 3 2]
[3]
[5 1 3]
[3]
[5 1 3]
[3]
[3 5 1]
[3]
[5 3 1]
[3]
[3 5 4]
[3]
[5 4 3 2 1]
[2]
[5 1 3]
[2]
[4 5 3 1 2]
[3]
[5 1 3]
[2]
[5 1 3]
[2]
[5 1 3]
[2]
[5 3]
[2]
[3 5]
[1]
[5 3]
[2]
[5 1]
[3]
[5]
[3]
[3 2 5 4]
[3]
[5 4 3 2 1]
[2]
[3 2 4 1]
[2]
[4 1 5 3]
[3]
[3 5 4 2 1]
[2]
[1 5]
[3]
[2 4 3 1]
[3]
[1 2 3]
[3]
[5 1]
[3]
[5 1]
[3]
[3 5 1]
[2]


- No visible exceptional data, next we will check that all values are in range:

In [13]:
for i in range(4, 66): # Check for unvalid values in risk factor and risk factor impact columns (risk factor has an enev index number, and risk factor impact has an odd one).  
    col_name = df_numerical.columns[i]
    if (i%2 == 0):
        total_unvalid =+ sum((df_numerical[col_name] < 1) + sum(df_numerical[col_name] > 5))
    else:
        total_unvalid =+ sum((df_numerical[col_name] < 1) + sum(df_numerical[col_name] > 3))

total_unvalid += sum((df_numerical['Risk Score'] < 0) + sum(df_numerical['Risk Score'] > 155)) # Check for outliers in the risk score column.

print('Total Unvalid values: ', total_unvalid)

Total Unvalid values:  0


* <b>All values are valid and in range :)<b/> 