## Exploratory Data Analysis (EDA)

The initial step of EDA (and everyone has their own sequence of steps) is managing or "Wrangling" data (e.g. preprocessing, organising, and summarising data).  This can involves "cleaning" tasks such as eliminating rows or columns with nulls (NaNs, Not a Number sometimes for Python numeric data types - this sometimes is "meaningful" substitution of "blank" values instead of simple eradication), "one-hot encoding"(somtimes referred to by the initialism OHE) to change categorical data to ordinals to allow it to be included in computations (primarily for ML), "rescaling" numbers for tehe sake of increasing effectiveness, or other "transformations" of data to aid processing or avoid errors down the line. Organising actions covers things like reducing it to certain rows or columns (of course, there needs to be a "method to the madness"). "subdividing" the dataset into more "manageable chunks" for things such as viewing (often print outs), or practicalities as "reindexing" (in case, they are missing, colunmns need to be added, or any changes dictated by how the data is planned to be used).  Summarising may comprise Descriptive Statistics (somtimes a Python method like *describe* is used assuming you've loaded it into a DataFrame) to get measures of central tendency or variability (spread), visualisations, or some combination of these to try and understand the data. At times, it may be useful to look at the distrubuttion (i.e. count and frequency - visually through a histogram)  or range (i.e. miniumum, maximum, and quartiles relaive to the mean - visually through a box plot or box & whisker diagram). Kindly refer to: https://github.com/LinsAbadia/Python/blob/master/Statistics/Descriptive.ipynb for more information in doing summaries using Descriptive Statistics.

Generally, EDA is performed to help answer a question of interest. More often than not, this is a research question in academic circles.

This is partly based on the Wesleyan Coursera Week 2/3 module on Data Management and Visualization (https://www.coursera.org/learn/data-visualization/home/week/2)

In [55]:
import pandas as pd
import numpy as np

np.random.seed(123) #I am using seed so that the random generator always produces the same results and you can see similar outputs

dataset = pd.DataFrame(np.random.randint(0,100,size=(1024, 3)), columns= list('abc')) #we generate random numbers so we have a DataFrame

print(dataset)


       a   b   c
0     66  92  98
1     17  83  57
2     86  97  96
3     47  73  32
4     46  96  25
...   ..  ..  ..
1019  35  70  65
1020  65  84  75
1021  11  41   3
1022  35  54  38
1023   4  54  25

[1024 rows x 3 columns]


In [56]:
dataset.describe()

Unnamed: 0,a,b,c
count,1024.0,1024.0,1024.0
mean,48.681641,49.978516,49.041016
std,29.409407,29.162295,28.976889
min,0.0,0.0,0.0
25%,23.0,24.0,23.0
50%,48.0,50.0,48.5
75%,74.25,76.25,75.0
max,99.0,99.0,99.0


In [57]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1024 entries, 0 to 1023
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   a       1024 non-null   int64
 1   b       1024 non-null   int64
 2   c       1024 non-null   int64
dtypes: int64(3)
memory usage: 24.1 KB


In [58]:
data = dataset[dataset['a'] > dataset['a'].mean()]#truncate the dataset based on desired criteria

In [59]:
data.describe()

Unnamed: 0,a,b,c
count,508.0,508.0,508.0
mean,74.521654,51.403543,48.822835
std,14.972214,29.037019,29.871588
min,49.0,0.0,0.0
25%,61.75,26.0,22.0
50%,75.0,51.0,48.0
75%,87.0,78.0,75.0
max,99.0,99.0,99.0


It is sometimes "obvious" to spot patterns visually but ocassionally when there are "a lot" of observations you'll need to leverage tools (e.g. spreadsheet, statistical packages, programatically through software, etc.) as machines can detect what can't be observed by the "naked" human eye.   To reduce complexity (as well as reduce execution time), it is often necessary to remove some columns (i.e. features) that appear to contribute little "value-added".  This, in effect, increases the "weight" of "important" features.  Moreover for the sake of readability, you migth rename a column as "data dictionaries/codebooks" may not be easily accessible.  Feature engineering tasks such as these can eventually help with the developed model's "performance." 

In [60]:
data.rename(columns={'a':'Var A', 'b':'Var B', 'c' : 'Var C'}, inplace = True)

print(data.describe()) #I used print to get rid of the copy warning

            Var A       Var B       Var C
count  508.000000  508.000000  508.000000
mean    74.521654   51.403543   48.822835
std     14.972214   29.037019   29.871588
min     49.000000    0.000000    0.000000
25%     61.750000   26.000000   22.000000
50%     75.000000   51.000000   48.000000
75%     87.000000   78.000000   75.000000
max     99.000000   99.000000   99.000000


In [61]:
dataA = data['Var A']

dataA.reindex() #reset index for Series; otherwise this will be inherited from the DataFrame 

dataA.describe()

count    508.000000
mean      74.521654
std       14.972214
min       49.000000
25%       61.750000
50%       75.000000
75%       87.000000
max       99.000000
Name: Var A, dtype: float64

In [62]:
row = {'Var A': ['string'],'Var B':['string object'], 'Var C': ['string/object']}
nonNumericRow = pd.DataFrame(row)

In [63]:
print(type(nonNumericRow), '\n' ,nonNumericRow) #verify correct data type and format

<class 'pandas.core.frame.DataFrame'> 
     Var A          Var B          Var C
0  string  string object  string/object


In [64]:
data = pd.concat([data,nonNumericRow], ignore_index = True)

In [65]:
blankRow = pd.Series(dtype = np.int64)  #the dtype parameter was added to elinate the "new" warning

In [66]:
data.append(blankRow, ignore_index = True)

Unnamed: 0,Var A,Var B,Var C
0,66,92,98
1,86,97,96
2,83,78,36
3,96,80,68
4,49,55,67
...,...,...,...
505,67,4,25
506,94,17,8
507,65,84,75
508,string,string object,string/object


In [67]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 509 entries, 0 to 508
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Var A   509 non-null    object
 1   Var B   509 non-null    object
 2   Var C   509 non-null    object
dtypes: object(3)
memory usage: 12.1+ KB


NOTE: Observe the non-null count

EDA may involve Univariate Analysis: the particular statistical "significance" of a single variable (i.e. column or feature). The distribution of the variable can be represented by a frequency table: the count and percentage.

In [68]:
data['Var A'].value_counts(sort=False) #these generate frequency counts, sorting is disabled otherwise the values are displayed according to number of instances

string     1
49        10
50        13
51         8
52         8
53        12
54        10
55         9
56        11
57         7
58        16
59        10
60         9
61         4
62         9
63        12
64         6
65         8
66         8
67         7
68        11
69        12
70         6
71        13
72        12
73         9
74        12
75         8
76         8
77        14
78        10
79         6
80        10
81        12
82         6
83        13
84        17
85         8
86        12
87         8
88         7
89        11
90         9
91        13
92         7
93         9
94        13
95         7
96         9
97        14
98        10
99        15
Name: Var A, dtype: int64

In [69]:
dataA.value_counts(sort=False)

49    10
50    13
51     8
52     8
53    12
54    10
55     9
56    11
57     7
58    16
59    10
60     9
61     4
62     9
63    12
64     6
65     8
66     8
67     7
68    11
69    12
70     6
71    13
72    12
73     9
74    12
75     8
76     8
77    14
78    10
79     6
80    10
81    12
82     6
83    13
84    17
85     8
86    12
87     8
88     7
89    11
90     9
91    13
92     7
93     9
94    13
95     7
96     9
97    14
98    10
99    15
Name: Var A, dtype: int64

In [70]:
data['Var A'].value_counts(sort=False, normalize = True) #these generate frequency percentages

string    0.001965
49        0.019646
50        0.025540
51        0.015717
52        0.015717
53        0.023576
54        0.019646
55        0.017682
56        0.021611
57        0.013752
58        0.031434
59        0.019646
60        0.017682
61        0.007859
62        0.017682
63        0.023576
64        0.011788
65        0.015717
66        0.015717
67        0.013752
68        0.021611
69        0.023576
70        0.011788
71        0.025540
72        0.023576
73        0.017682
74        0.023576
75        0.015717
76        0.015717
77        0.027505
78        0.019646
79        0.011788
80        0.019646
81        0.023576
82        0.011788
83        0.025540
84        0.033399
85        0.015717
86        0.023576
87        0.015717
88        0.013752
89        0.021611
90        0.017682
91        0.025540
92        0.013752
93        0.017682
94        0.025540
95        0.013752
96        0.017682
97        0.027505
98        0.019646
99        0.029470
Name: Var A,

In [71]:
dataA.value_counts(sort=False, normalize = True)

49    0.019685
50    0.025591
51    0.015748
52    0.015748
53    0.023622
54    0.019685
55    0.017717
56    0.021654
57    0.013780
58    0.031496
59    0.019685
60    0.017717
61    0.007874
62    0.017717
63    0.023622
64    0.011811
65    0.015748
66    0.015748
67    0.013780
68    0.021654
69    0.023622
70    0.011811
71    0.025591
72    0.023622
73    0.017717
74    0.023622
75    0.015748
76    0.015748
77    0.027559
78    0.019685
79    0.011811
80    0.019685
81    0.023622
82    0.011811
83    0.025591
84    0.033465
85    0.015748
86    0.023622
87    0.015748
88    0.013780
89    0.021654
90    0.017717
91    0.025591
92    0.013780
93    0.017717
94    0.025591
95    0.013780
96    0.017717
97    0.027559
98    0.019685
99    0.029528
Name: Var A, dtype: float64

* NOTE: dataA is transformed to a Series from a DataFrame as there is only a single column.

In [72]:
 dataA.convert_dtypes(convert_integer = True) #this is to eliminate non-numeric values - you might need to this as it may be treated as a string and presented in another way (e.g. "1" is followed by "10")  

0       66
2       86
5       83
6       96
7       49
        ..
1012    56
1013    89
1014    67
1015    94
1020    65
Name: Var A, Length: 508, dtype: Int64

In [73]:
dataA.value_counts(sort = False, dropna = True)

49    10
50    13
51     8
52     8
53    12
54    10
55     9
56    11
57     7
58    16
59    10
60     9
61     4
62     9
63    12
64     6
65     8
66     8
67     7
68    11
69    12
70     6
71    13
72    12
73     9
74    12
75     8
76     8
77    14
78    10
79     6
80    10
81    12
82     6
83    13
84    17
85     8
86    12
87     8
88     7
89    11
90     9
91    13
92     7
93     9
94    13
95     7
96     9
97    14
98    10
99    15
Name: Var A, dtype: int64

As most prrogramming languages are "divergent", sometimes there are different ways to accomplish the task.  
for eample, instead of *value_counts* you van use *groupby*.

In [74]:
data.groupby('Var A').size()

Var A
49        10
50        13
51         8
52         8
53        12
54        10
55         9
56        11
57         7
58        16
59        10
60         9
61         4
62         9
63        12
64         6
65         8
66         8
67         7
68        11
69        12
70         6
71        13
72        12
73         9
74        12
75         8
76         8
77        14
78        10
79         6
80        10
81        12
82         6
83        13
84        17
85         8
86        12
87         8
88         7
89        11
90         9
91        13
92         7
93         9
94        13
95         7
96         9
97        14
98        10
99        15
string     1
dtype: int64

In [75]:
data.groupby('Var A').size()

Var A
49        10
50        13
51         8
52         8
53        12
54        10
55         9
56        11
57         7
58        16
59        10
60         9
61         4
62         9
63        12
64         6
65         8
66         8
67         7
68        11
69        12
70         6
71        13
72        12
73         9
74        12
75         8
76         8
77        14
78        10
79         6
80        10
81        12
82         6
83        13
84        17
85         8
86        12
87         8
88         7
89        11
90         9
91        13
92         7
93         9
94        13
95         7
96         9
97        14
98        10
99        15
string     1
dtype: int64

In [76]:
data.groupby('Var A').size()

Var A
49        10
50        13
51         8
52         8
53        12
54        10
55         9
56        11
57         7
58        16
59        10
60         9
61         4
62         9
63        12
64         6
65         8
66         8
67         7
68        11
69        12
70         6
71        13
72        12
73         9
74        12
75         8
76         8
77        14
78        10
79         6
80        10
81        12
82         6
83        13
84        17
85         8
86        12
87         8
88         7
89        11
90         9
91        13
92         7
93         9
94        13
95         7
96         9
97        14
98        10
99        15
string     1
dtype: int64

In [77]:
data.groupby('Var A').size()

Var A
49        10
50        13
51         8
52         8
53        12
54        10
55         9
56        11
57         7
58        16
59        10
60         9
61         4
62         9
63        12
64         6
65         8
66         8
67         7
68        11
69        12
70         6
71        13
72        12
73         9
74        12
75         8
76         8
77        14
78        10
79         6
80        10
81        12
82         6
83        13
84        17
85         8
86        12
87         8
88         7
89        11
90         9
91        13
92         7
93         9
94        13
95         7
96         9
97        14
98        10
99        15
string     1
dtype: int64

In [78]:
data.groupby('Var A').size()

Var A
49        10
50        13
51         8
52         8
53        12
54        10
55         9
56        11
57         7
58        16
59        10
60         9
61         4
62         9
63        12
64         6
65         8
66         8
67         7
68        11
69        12
70         6
71        13
72        12
73         9
74        12
75         8
76         8
77        14
78        10
79         6
80        10
81        12
82         6
83        13
84        17
85         8
86        12
87         8
88         7
89        11
90         9
91        13
92         7
93         9
94        13
95         7
96         9
97        14
98        10
99        15
string     1
dtype: int64

In [79]:
print(data[data['Var A']=='string'])

      Var A          Var B          Var C
508  string  string object  string/object


In [80]:
data['Var A'].replace(to_replace = 'string', value = np.nan, inplace=True)#ensure there are no non-numeric entries

In [81]:
print(data[data['Var A']=='string'])

Empty DataFrame
Columns: [Var A, Var B, Var C]
Index: []


In [85]:
data.dropna(inplace = True) #remove "null" values; in the case of Python Nans

In [86]:
data.groupby('Var A').size() * 100 / len(data['Var A'])

Var A
49.0    1.968504
50.0    2.559055
51.0    1.574803
52.0    1.574803
53.0    2.362205
54.0    1.968504
55.0    1.771654
56.0    2.165354
57.0    1.377953
58.0    3.149606
59.0    1.968504
60.0    1.771654
61.0    0.787402
62.0    1.771654
63.0    2.362205
64.0    1.181102
65.0    1.574803
66.0    1.574803
67.0    1.377953
68.0    2.165354
69.0    2.362205
70.0    1.181102
71.0    2.559055
72.0    2.362205
73.0    1.771654
74.0    2.362205
75.0    1.574803
76.0    1.574803
77.0    2.755906
78.0    1.968504
79.0    1.181102
80.0    1.968504
81.0    2.362205
82.0    1.181102
83.0    2.559055
84.0    3.346457
85.0    1.574803
86.0    2.362205
87.0    1.574803
88.0    1.377953
89.0    2.165354
90.0    1.771654
91.0    2.559055
92.0    1.377953
93.0    1.771654
94.0    2.559055
95.0    1.377953
96.0    1.771654
97.0    2.755906
98.0    1.968504
99.0    2.952756
dtype: float64

In [83]:
print(data[data['Var A']=='string'])

Empty DataFrame
Columns: [Var A, Var B, Var C]
Index: []


In [84]:
data.groupby('Var A').size() * 100 / len(data['Var A'])

Var A
49.0    1.964637
50.0    2.554028
51.0    1.571709
52.0    1.571709
53.0    2.357564
54.0    1.964637
55.0    1.768173
56.0    2.161100
57.0    1.375246
58.0    3.143418
59.0    1.964637
60.0    1.768173
61.0    0.785855
62.0    1.768173
63.0    2.357564
64.0    1.178782
65.0    1.571709
66.0    1.571709
67.0    1.375246
68.0    2.161100
69.0    2.357564
70.0    1.178782
71.0    2.554028
72.0    2.357564
73.0    1.768173
74.0    2.357564
75.0    1.571709
76.0    1.571709
77.0    2.750491
78.0    1.964637
79.0    1.178782
80.0    1.964637
81.0    2.357564
82.0    1.178782
83.0    2.554028
84.0    3.339882
85.0    1.571709
86.0    2.357564
87.0    1.571709
88.0    1.375246
89.0    2.161100
90.0    1.768173
91.0    2.554028
92.0    1.375246
93.0    1.768173
94.0    2.554028
95.0    1.375246
96.0    1.768173
97.0    2.750491
98.0    1.964637
99.0    2.946955
dtype: float64

Generally, this produces only "intermediate" insights so you probably need to iterate( that is, do it multiple times as you make "refinements") through various stages to arrive at an appropriate "final" finding 