1. What is the Purpose of the Data?
* What is this data about?
* Why was it collected?
* What problem are we trying to solve with it?

2. Where Did the Data Come From?
* What is the source of this data? (Sensors, surveys, databases, etc.)
* How was it collected? (Manual entry, automated scraping, API, etc.)
* Is the data trustworthy and accurate?

3. What Does the Data Look Like?
* What are the key columns (features) in the dataset?
* What type of data is in each column? (Numbers, text, dates, categories, etc.)
* How many rows and columns are there?

4. Are There Any Missing or Incorrect Values?
* Are there empty or missing values?
* Are there any obvious mistakes in the data? (Typos, incorrect formats, etc.)
* How should we handle missing or incorrect data?

5. Are There Any Duplicates?
* Are there repeated rows in the dataset?
* Should duplicates be removed or kept?

6. What is the Distribution of the Data?
* What is the range of values in each column? (Min, max, average)
* Are there any extreme values (outliers)?
* Is the data balanced, or do some values appear much more often than others?

7. Are There Relationships Between Features?
* Do some columns depend on others?
* Are any columns strongly correlated?
* Does the data follow any patterns or trends?

8. If There Are Labels (For Supervised Learning), Are They Correct?
* Does the dataset have labels (target values)?
* How were the labels assigned?
* Are there any errors or inconsistencies in the labels?

9. How Often is the Data Updated?
* Is this a static dataset, or will new data keep coming in?
* How frequently does the data change?

10. Are There Any Ethical or Privacy Concerns?
* Does the data contain sensitive or personal information?
* Are there any biases in the data that could affect decision-making?

In [1]:
import pandas as pd

In [2]:
df=pd.read_csv('train.csv')

### 1) How big is the data?

In [3]:
# gets the answer
df.shape

(891, 12)

### 2) How does the data look like?

In [4]:
# gets a general overview of what the data is like
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
# picks 5 random points
df.sample(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
98,99,1,2,"Doling, Mrs. John T (Ada Julia Bone)",female,34.0,0,1,231919,23.0,,S
492,493,0,1,"Molson, Mr. Harry Markland",male,55.0,0,0,113787,30.5,C30,S
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S
883,884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5,,S
58,59,1,2,"West, Miss. Constance Mirium",female,5.0,1,2,C.A. 34651,27.75,,S


### 3) Data type of the columns?

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### 4) Are there any missing values?

In [8]:
# tells the no of missing values per column
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

### 5) How does the data look mathematically?

In [10]:
# basically the statistics
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### 6) Are there any duplicate values?

In [11]:
# gives the no of duplicate rows
df.duplicated().sum()

0

### 7) How are the columns correlated?

In [14]:
# used mainly for removing all the columns that have no impact on the output thus we can remove it
df.select_dtypes(include=['number']).corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


In [15]:
df.select_dtypes(include=['number']).corr()['Survived']

PassengerId   -0.005007
Survived       1.000000
Pclass        -0.338481
Age           -0.077221
SibSp         -0.035322
Parch          0.081629
Fare           0.257307
Name: Survived, dtype: float64