# Basic Data Exploration

### What is Basic Data Exploration?

Before we analyze, visualize, or build models with our data, we must first understand what we’re working with. That’s where **basic data exploration** comes in. This is our very first step in any data science or machine learning project. When we open a dataset like Titanic, we want to quickly check how many rows and columns it has, what kinds of values are stored in each column, whether the data looks clean, and if there are any surprises — like missing values, incorrect types, or weird outliers. We don’t need to read the entire file manually; instead, we use built-in Pandas tools to get a strong first impression.

Using simple and powerful commands like `.head()`, `.tail()`, `.info()`, `.describe()`, and `.shape`, we can uncover the structure, quality, and summary statistics of our dataset within seconds. These methods form the backbone of every Exploratory Data Analysis (EDA) workflow. They help us decide what to clean, what to visualize, and what to use in our models. We’re not trying to draw conclusions at this point — just trying to **understand** the data so we can make the right choices moving forward. Every good AI/ML workflow starts with this step, and the stronger our grasp here, the better our results later.

### `.head()` and `.tail()`: Previewing Our Dataset

When we load a new dataset, we usually don’t want to print the entire thing. Instead, we use `.head()`to preview the **first few rows**. By default, it shows us the top 5 rows, which is enough to check whether our data loaded correctly. We can confirm that column names are recognized, that data values look realistic, and that nothing broke during the loading process. For example, in the Titanic dataset, we can see how passenger information is stored — including names, age, ticket class, fare, and more.

Similarly, `.tail()` shows us the **last few rows** of the DataFrame. This helps us verify the end of the file, check for empty rows, or look at recently added records. Together, `.head()` and `.tail()` give us a complete view from both the top and bottom — making sure we’re not missing anything.

In [2]:
import pandas as pd

In [None]:
df = pd.read_csv("file_name.csv") 

print(df.head())   # View first 5 rows
print(df.tail(3))  # View last 3 rows

### `.shape`: Checking the Size of Our Dataset

The `.shape` attribute gives us the number of **rows and columns** in our DataFrame. This is important because it tells us how big our dataset is — whether it has a few records or thousands. In the Titanic dataset, we might get something like `(891, 12)`, which means we have 891 rows and 12 columns.

Knowing the shape helps us plan ahead. If we’re working with 100,000+ rows, we might want to sample. If we only have 10 rows, we might manually review them. Later, when we clean data or drop columns, checking `.shape` confirms that changes happened correctly.


In [None]:
df.shape

### `.info()`: Getting the Structure and Data Types

The `.info()` method is one of the most useful tools when we first load a dataset. It gives us a detailed overview of **each column**: the name, data type (`int`, `float`, `object`, etc.), and how many non-null values are present. This helps us quickly spot **missing values** and understand whether our data types are what we expect.

For instance, if the "Age" column has 714 non-null values but our dataset has 891 rows, we know that some ages are missing. Also, if we find a column that should be numeric but is marked as `object`, it means we need to clean or convert it. Understanding the structure early helps us avoid bugs and errors later.

In [None]:
df.info()

### `.describe()`: Summarizing Numerical Columns

To quickly explore all numeric columns, we use `.describe()`. It gives us useful statistics like **mean**, **standard deviation**, **minimum and maximum**, and the 25th, 50th (median), and 75th percentiles. This tells us whether a feature is skewed, whether it has outliers, and how values are distributed.

For example, in the Titanic dataset, we can use `.describe()` to understand the distribution of fares. If the max fare is much higher than the 75th percentile, that could indicate outliers. We also see how many people have values in each column. These summaries are critical when we’re preparing features for machine learning, scaling values, or detecting anomalies.

In [None]:
df.describe()

We can also summarize **non-numeric (object) columns** like names or categories using:

In [None]:
df.describe(include='object')

### Exercises

Q1. Load the Titanic dataset and use `.head()` to display the first 10 rows.

In [3]:
data = pd.read_csv('data/train.csv')

print(data.head(10))

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   
5            6         0       3   
6            7         0       1   
7            8         0       3   
8            9         1       3   
9           10         1       2   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   
5                                   Moran, Mr. James    male   NaN      0   
6                            McCarthy, Mr. Timothy J    male  54

Q2. Use `.tail()` to display the last 7 rows.

In [4]:
print(data.tail(7))

     PassengerId  Survived  Pclass                                      Name  \
884          885         0       3                    Sutehall, Mr. Henry Jr   
885          886         0       3      Rice, Mrs. William (Margaret Norton)   
886          887         0       2                     Montvila, Rev. Juozas   
887          888         1       1              Graham, Miss. Margaret Edith   
888          889         0       3  Johnston, Miss. Catherine Helen "Carrie"   
889          890         1       1                     Behr, Mr. Karl Howell   
890          891         0       3                       Dooley, Mr. Patrick   

        Sex   Age  SibSp  Parch           Ticket    Fare Cabin Embarked  
884    male  25.0      0      0  SOTON/OQ 392076   7.050   NaN        S  
885  female  39.0      0      5           382652  29.125   NaN        Q  
886    male  27.0      0      0           211536  13.000   NaN        S  
887  female  19.0      0      0           112053  30.000   B42 

Q3. Use `.shape` to find out how many rows and columns the dataset has.

In [5]:
print(data.shape)

(891, 12)


Q4. Use `.info()` to check which columns contain missing values.

In [6]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None


Q5. Use `.describe()` to summarize the numerical columns.

In [7]:
print(data.describe())

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  


Q6. Use `.describe(include='object')` to explore non-numeric columns.

In [8]:
print(data.describe(include='object'))

                           Name   Sex  Ticket Cabin Embarked
count                       891   891     891   204      889
unique                      891     2     681   147        3
top     Braund, Mr. Owen Harris  male  347082    G6        S
freq                          1   577       7     4      644


### Summary

When we begin working with any dataset — whether it's for learning, analysis, or building AI models — we always start with basic data exploration. This stage helps us understand what we’re dealing with. The `.head()` and `.tail()` methods give us a quick look at the start and end of the data. The `.shape` property tells us the dataset’s size. The `.info()` method helps us see what kind of data we have and whether anything is missing. The `.describe()` method summarizes all numerical values so we can spot trends, outliers, or unusual distributions.

Together, these tools help us prepare our minds for deeper analysis. We start to see patterns, detect problems, and figure out what needs to be cleaned or transformed. As we move toward visualization, modeling, and prediction, this foundation will guide our decisions. If we skip this step, we risk building models on broken or misunderstood data. But when we do it right — when we explore our data carefully — everything else becomes easier and more reliable.