# **Lab1: Introduction to Data Analysis**

**Course**: **INF-604: Data Analysis** <br>
**Lecturer**: **Sothea HAS, PhD**

-----

**Objective:**  You have already seen some elements of Data Analysis in the course. In this lab, we will take our first step into working with the main element of Data Analysis, which is the dataset. By the end of this lab, you will be able to import data into a Jupyter Notebook and perform some data manipulation.

- The `notebook` of this `Lab` can be downloaded here: [Lab1_Introduction.ipynb](https://hassothea.github.io/Data_Analysis_AUPP/Labs/Lab1_Introduction.ipynb).

- Or you can work directly with `Google Colab` here: [Lab1_Introduction.ipynb](https://colab.research.google.com/drive/14L1fgW35_yZAW3BIsG-oGLxBO0lXANMO?usp=sharing).

-----


- `Student's name:` Mork Sunhout
- `Year:` Junior
- `Major:` Software Development

-----

## **1. Data for Your Business**

Imagine you want to start your own business, such as a coffee shop or a bookstore. What types of data do you think you need to gather to determine the potential success of your business? Here are some questions to help you think and answer this question:

- What is your plan for the business?

- What information might you need to collect? What is the size of the data?

- Where do you think you can find this information?

- What might go wrong with the collected data?

- We handle such problems in what step of Data Analysis process?

`Answer:` 1. The plan is to open a coffee shop in Phnom Penh that primarily targets university students and young professionals. The shop will offer affordable drinks, free Wi-Fi, and a cozy environment designed for studying or working. To stand out from competitors, the business will provide unique menu items and special discounts for students.

2. To see if the coffee shop can succeed, we need data on customers (age, income, lifestyle, student or worker), the market (number of coffee shops, prices, reviews, busy hours), and the location (foot traffic, nearby schools or offices, rent). We also need to know how often people buy coffee, how much they spend, and what drinks they like. The data will be small to medium in size but can grow with social media reviews, sales records, and customer loyalty programs.

3. We can find this information from three main sources. Primary data comes from surveys, interviews, focus groups, or watching customer traffic. Secondary data includes government reports, market studies, online reviews, and competitors’ websites or menus. Digital sources like social media and Google Trends can also show customer opinions and market trends.

4. The data collected may have problems. Some surveys might be incomplete, or responses may be biased if only certain groups answer. Reports could be outdated, and customer answers may not match their real behavior. There is also a risk of collecting too much unnecessary data, which can make analysis harder.

5. These problems are handled during the data collection and cleaning stages. Data collection makes sure the information is reliable, while data cleaning fixes issues like missing or repeated data. Doing this well helps ensure the analysis is accurate and useful for business decisions.
---------


## **2. Importing Some Data**


There are many online data sources that you can explore, and one of the most popular is [`Kaggle`](https://www.kaggle.com/datasets/). In addition to datasets, `Kaggle` also hosts data competitions with prizes and offers courses to help you advance in data learning.


Here, we start our journey by exploring a dataset that you probably have heard its name before: [`Titanic`](https://www.kaggle.com/datasets/mahmoudsaadmohamed/titanic-dataset). You can download it from `Kaggle` using the following codes.

In [None]:
# %pip install kagglehub

import kagglehub

# Download latest version
path = kagglehub.dataset_download("yasserh/titanic-dataset")


# Pandas module allows you to import the data
import pandas as pd
data = pd.read_csv(path+'/Titanic-Dataset.csv')
data.head(10)

Downloading from https://www.kaggle.com/api/v1/datasets/download/yasserh/titanic-dataset?dataset_version_number=1...


100%|██████████| 22.0k/22.0k [00:00<00:00, 12.7MB/s]

Extracting files...





Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


-------

### **2.1. Overview of the data**


Answer the following questions:

**A.** How many rows and columns are there in this dataset?

**B.** Explain the meaning of each column.

**C.** Are there any missing values in this dataset? If so, how many rows contain at least one missing value?

- What should you do with column `Cabin`?

- How would you drop rows with at least one missing value?


------------------

In [None]:
# To do

# Question A

data.shape


(891, 12)

In [None]:
# Question B

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [None]:
# C
# There are many missing values in this dataset.  rows contain at least one missing value.
data.isnull().sum()/data.shape[0]


Unnamed: 0,0
PassengerId,0.0
Survived,0.0
Pclass,0.0
Name,0.0
Sex,0.0
Age,0.198653
SibSp,0.0
Parch,0.0
Ticket,0.0
Fare,0.0


In [None]:
# Drop Cabin but do not modify the original data

# data_no_cabin = data.drop(columns=['Cabin'])

In [None]:
# drop it from the dataset
data.drop(columns=['Cabin'],inplace=True)

In [None]:
# drop rows with at least one missing value

data.dropna(inplace=True)

-------

### **2.2. Single information**


**D.** How many male and female passengers were on the ship?

**E.** How many of them survived? How many didn't?

**F.** How many passengers were younger than 3 years old? How many were older than 60 years old?

**G.** How many passengers embarked from the three ports?

- `C`: Cherbourg, France.
- `Q`: Queentown, Ireland.
- `S`: Southampton, England.

**H.** How many passengers were in the 1st, 2nd and 3rd class?


In [None]:
# To do

# Question D

# There are 453 males, 259 females.

data['Sex'].value_counts()


Unnamed: 0_level_0,count
Sex,Unnamed: 1_level_1
male,453
female,259


In [None]:
# Question E

# 288 Survived, 424 Didn't.

data['Survived'].value_counts()

Unnamed: 0_level_0,count
Survived,Unnamed: 1_level_1
0,424
1,288


In [None]:
# Question F

# There were 24 passengers younger than 3 years old.

data.query('Age < 3').shape

(24, 11)

In [None]:
# Question F

# There were 21 passengers older than 60 years old.

data.query('Age > 60').shape



(21, 11)

In [None]:
# Question G

# There was 554 Embarked from Southampton, 130 from Cherbourg, 28 from Queenstown.

data['Embarked'].value_counts()

Unnamed: 0_level_0,count
Embarked,Unnamed: 1_level_1
S,554
C,130
Q,28


In [None]:
# Question H

# There was 184 from 1st class, 173 from 2nd class, and 355 from 3rd class.

data['Pclass'].value_counts()

Unnamed: 0_level_0,count
Pclass,Unnamed: 1_level_1
3,355
1,184
2,173


-------------

### **2.3. Multiple information**

**I.** How many 1st class passengers survived? How about 2nd and 3rd class?

**J.** How many female passengers survived? How many males did?

**K.** How many people from each embarkation port survived?

**L.** Was `Jack` on the ship? How about `Rose`?

In [None]:
# To do

# Question I

# There was 120 passengers survived from 1st class, 83 passengers from 2nd class, and 85 passengers from 3rd class.

data[data['Survived'] == 1]['Pclass'].value_counts()


Unnamed: 0_level_0,count
Pclass,Unnamed: 1_level_1
1,120
3,85
2,83


In [None]:
# Question J

# A total of 195 female passengers and 93 male passengers survived.

data[data['Survived'] == 1]['Sex'].value_counts()

Unnamed: 0_level_0,count
Sex,Unnamed: 1_level_1
female,195
male,93


In [None]:
# Question K

# Among the survivors, 201 passengers embarked from Southampton, 79 from Cherbourg, and 8 from Queenstown.

data[data['Survived'] == 1]['Embarked'].value_counts()

Unnamed: 0_level_0,count
Embarked,Unnamed: 1_level_1
S,201
C,79
Q,8


In [None]:
# Question L

# Jack and Rose was not on the ship.

(data[data['Name'].str.contains("Jack", case=False)])
(data[data['Name'].str.contains("Rose", case=False)])



Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
72,73,0,2,"Hood, Mr. Ambrose Jr",male,21.0,0,0,S.O.C. 14879,73.5,S
855,856,1,3,"Aks, Mrs. Sam (Leah Rosen)",female,18.0,0,1,392091,9.35,S


# **Further Reading**

- `Pandas` python library: https://pandas.pydata.org/docs/getting_started/index.html#getting-started

- `10 Minute to Pandas`: https://pandas.pydata.org/docs/user_guide/10min.html

- `Some Pandas Lession`: https://www.kaggle.com/learn/pandas

---------