# Exploratory Data Analysis (EDA)
We will explore the dataset to better understand the characteristics of the data and check for the presence of missing values ​​or anomalies. This will help us prepare the data for the prediction model.

In [20]:
import pandas as pd

The first step is to **take the data** and format it as desired. In this case, I will replace all data points marked as "?" with NaN fields, manually enter the column names, and finally, delete the unwanted columns.

In [21]:
file = "/Users/leona/Desktop/Projeto 1/mammogram-result-prediction/data/raw/mammographic_masses.data"
col_names = ['BI-RADS', 'Age', 'Shape', 'Margin', 'Density', 'Severity']

df = pd.read_csv(file, na_values=['?'], names=col_names)
df.drop("BI-RADS", axis=1, inplace=True)

In [22]:
df.head()

Unnamed: 0,Age,Shape,Margin,Density,Severity
0,67.0,3.0,5.0,3.0,1
1,43.0,1.0,1.0,,1
2,58.0,4.0,5.0,3.0,1
3,28.0,1.0,1.0,3.0,0
4,74.0,1.0,5.0,,1


In [23]:
df.describe()

Unnamed: 0,Age,Shape,Margin,Density,Severity
count,956.0,930.0,913.0,885.0,961.0
mean,55.487448,2.721505,2.796276,2.910734,0.463059
std,14.480131,1.242792,1.566546,0.380444,0.498893
min,18.0,1.0,1.0,1.0,0.0
25%,45.0,2.0,1.0,3.0,0.0
50%,57.0,3.0,3.0,3.0,0.0
75%,66.0,4.0,4.0,3.0,1.0
max,96.0,4.0,5.0,4.0,1.0


The next step is to check the amount of **NaN** data present in each column.

In [24]:
df.loc[(df["Age"].isnull())     |
       (df["Shape"].isnull())   |
       (df["Margin"].isnull())  |
       (df["Density"].isnull()) |
       (df["Severity"].isnull())]

Unnamed: 0,Age,Shape,Margin,Density,Severity
1,43.0,1.0,1.0,,1
4,74.0,1.0,5.0,,1
5,65.0,1.0,,3.0,0
6,70.0,,,3.0,0
7,42.0,1.0,,3.0,0
...,...,...,...,...,...
778,60.0,,4.0,3.0,0
819,35.0,3.0,,2.0,0
824,40.0,,3.0,4.0,1
884,,4.0,4.0,3.0,1


In [25]:
null_count = df.isnull().sum()
null_count

Age          5
Shape       31
Margin      48
Density     76
Severity     0
dtype: int64