# Data preparing 3

This code finishes analyzing the first half of the dataset. In this part, we'll look into the numerical columns that might have some incorrect values. The analyzation was split into two parts due to how long it took for the first part to execute. We continue the analysis on the dataset in result_1.csv, generated at the end of dataset_preparing_1.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

In [2]:
data_1 = pd.read_csv("result_1.csv", encoding="iso-8859-1", low_memory=False)

In [3]:
list(data_1.columns)[:51]

['Unnamed: 0',
 'eventid',
 'iyear',
 'imonth',
 'iday',
 'approxdate',
 'extended',
 'resolution',
 'country_txt',
 'region_txt',
 'provstate',
 'city',
 'latitude',
 'longitude',
 'specificity',
 'vicinity',
 'crit1',
 'crit2',
 'crit3',
 'doubtterr',
 'alternative_txt',
 'multiple',
 'success',
 'suicide',
 'attacktype1_txt',
 'attacktype2_txt',
 'attacktype3_txt',
 'targtype1_txt',
 'targsubtype1_txt',
 'corp1',
 'target1',
 'natlty1_txt',
 'targtype2_txt',
 'targsubtype2_txt',
 'corp2',
 'target2',
 'natlty2_txt',
 'targtype3_txt',
 'targsubtype3_txt',
 'corp3',
 'target3',
 'natlty3_txt',
 'gname',
 'gsubname',
 'gname2',
 'gsubname2',
 'gname3',
 'gsubname3',
 'motive',
 'guncertain1',
 'guncertain2']

### extended, resolution, latitude, longitude

In [4]:
data_1.extended.value_counts()

0    173452
1      8239
Name: extended, dtype: int64

Looking at the value counts, this column seems to have no errors.

We don't plan to use resolution date in our analysis, so we will discard this column.

In [5]:
data_1 = data_1.drop('resolution', axis=1)

Latitude and longitude have missing values.

In [6]:
data_1.latitude.isna().sum()

4556

In [7]:
data_1.longitude.isna().sum()

4557

This is expected because the latitude and longitude only exist where the incidents occured in cities.

### specificity, vicinity

In [8]:
data_1.specificity.value_counts()

1.0    144996
3.0     14615
2.0      8990
4.0      8534
5.0      4550
Name: specificity, dtype: int64

In [9]:
data_1.vicinity.value_counts()

 0    168932
 1     12724
-9        35
Name: vicinity, dtype: int64

In [10]:
# Replace the odd -9 value with np.nan
data_1.loc[data_1.vicinity == -9, "vicinity"] = np.nan
data_1.vicinity.value_counts()

0.0    168932
1.0     12724
Name: vicinity, dtype: int64

### crti1, crit2, crit3

In [11]:
data_1.crit1.value_counts()

1    179607
0      2084
Name: crit1, dtype: int64

In [12]:
data_1.crit2.value_counts()

1    180436
0      1255
Name: crit2, dtype: int64

In [13]:
data_1.crit3.value_counts()

1    159101
0     22590
Name: crit3, dtype: int64

### doubtterr, alternative_txt, multiple, success, suicide

In [14]:
data_1.doubtterr.value_counts()

 0.0    138905
 1.0     29001
-9.0     13784
Name: doubtterr, dtype: int64

In [15]:
# Replace the odd -9 value with np.nan
data_1.loc[data_1.doubtterr == -9, "doubtterr"] = np.nan
data_1.doubtterr.value_counts()

0.0    138905
1.0     29001
Name: doubtterr, dtype: int64

In [16]:
data_1.alternative_txt.value_counts()

Insurgency/Guerilla Action    23410
Other Crime Type               3665
Intra/Inter-group Conflict     1296
State Actors                    321
Lack of Intentionality          319
Name: alternative_txt, dtype: int64

In [17]:
data_1.multiple.value_counts()

0.0    156658
1.0     25032
Name: multiple, dtype: int64

In [18]:
data_1.success.value_counts()

1    161632
0     20059
Name: success, dtype: int64

In [19]:
data_1.suicide.value_counts()

0    175058
1      6633
Name: suicide, dtype: int64

### guncertain1, guncertain2

In [20]:
data_1.guncertain1.value_counts()

0.0    166545
1.0     14766
Name: guncertain1, dtype: int64

In [21]:
data_1.guncertain2.value_counts()

0.0    1436
1.0     519
Name: guncertain2, dtype: int64

### Export

We have looked through all the numerical columns, export this dataset to a new csv file.

In [22]:
data_1.to_csv("result_1_2.csv")