
# Foundations of Data Science: Practical Assignment

Total Marks: 50

## Task 1: Introduction to Data Science (10 marks)

Data Science Overview (3 marks):
Provide a brief explanation of what data science is and its significance in various industries.

Setting up the Environment (2 marks):
Install necessary Python libraries for data science (e.g., NumPy, Pandas, Matplotlib) using Jupyter Notebook cells.

Loading and Displaying Data (5 marks):
Import a dataset of your choice (e.g., CSV or Excel file) using Pandas.
Display the first few rows of the dataset to understand its structure.

## Task 2: Core Concepts and Principles (15 marks)

Data Types and Structures (4 marks):
Create variables representing different data types (integer, float, string).
Explore and demonstrate basic operations on these variables.

Data Manipulation (5 marks):
Apply data manipulation techniques using Pandas.
Perform tasks such as filtering, sorting, and grouping on the loaded dataset.

Feature Engineering (6 marks):
Create a new feature based on existing features in the dataset.
Explain the rationale behind the feature engineering process.

## Task 3: Statistical Analysis (10 marks)

Descriptive Statistics (4 marks):
Compute and interpret key descriptive statistics (mean, median, standard deviation) for relevant columns in the dataset.

Inferential Statistics (6 marks):
Formulate a hypothesis related to the dataset.
Conduct a statistical test (e.g., t-test) to test the hypothesis.
Provide conclusions based on the test results.

## Task 4: Exploratory Data Analysis (15 marks)

Data Visualization (8 marks):
Create at least three visualizations (e.g., histograms, scatter plots, box plots) to explore relationships within the dataset.
Include appropriate labels and titles for clarity.

Outlier Detection (5 marks):
Identify and handle outliers in the dataset.
Explain the methodology and reasoning behind your approach.

Correlation Analysis (2 marks):
Compute and interpret the correlation between two relevant variables in the dataset.

Submission Instructions (2 marks)

**Jupyter Notebook Submission (2 marks):**

Compile all tasks in a Jupyter Notebook.
Include comments and explanations for each step.
Submit the notebook file or provide a GitHub link for review.
You will be required to submit this in the Folder for assignments in the Assignment 1.



# Task One: Introduction to Data Science

## Data Science Overview

Data Science is an interdisciplinary field that utilizes scientific methods, processes, algorithms, and systems to extract insights and knowledge from structured and unstructured data.

Data Science can be useful in any field, because it enables businesses to use data to make better decisions, cut costs, and innovate. It is particularly useful in healthcare, banking, retail, but it can add value to any industry (DIKW pyramid). 

## Setting up the Environment

In [5]:
# importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # for visualization
import seaborn as sns # for visualization

## Loading and Displaying Data

In [7]:
df = pd.read_csv('rossman_train.csv', engine='python') # loading the file, the engine part is to avoid the low memory error

In [8]:
df.head(5) # diplay first 5 rows of the dataset

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
0,1,5,2015-07-31,5263,555,1,1,0,1
1,2,5,2015-07-31,6064,625,1,1,0,1
2,3,5,2015-07-31,8314,821,1,1,0,1
3,4,5,2015-07-31,13995,1498,1,1,0,1
4,5,5,2015-07-31,4822,559,1,1,0,1


In [9]:
pd.set_option('display.float_format', str) # to avoid scientific notation
df.describe() # statistical summary

Unnamed: 0,Store,DayOfWeek,Sales,Customers,Open,Promo,SchoolHoliday
count,1017209.0,1017209.0,1017209.0,1017209.0,1017209.0,1017209.0,1017209.0
mean,558.4297268309659,3.998340557348588,5773.818972305593,633.1459464082602,0.8301066939045958,0.3815145166824124,0.1786466694651738
std,321.90865114345405,1.997390964946012,3849.926175241448,464.4117338873005,0.3755392246932824,0.4857586048745671,0.3830563681782103
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,280.0,2.0,3727.0,405.0,1.0,0.0,0.0
50%,558.0,4.0,5744.0,609.0,1.0,0.0,0.0
75%,838.0,6.0,7856.0,837.0,1.0,1.0,0.0
max,1115.0,7.0,41551.0,7388.0,1.0,1.0,1.0


## Task 2: Core Concepts and Principles

In [11]:
# Create variables representing different data types (integer, float, string). Explore and demonstrate basic operations on these variables.

name = 'Sherlock Holmes'
street = 'Baker Street'
street_number = '221'
age = 42
height = 1.96

In [12]:
name * 3 # variable 'name' is a string, but we can print it multiple times using math 

'Sherlock HolmesSherlock HolmesSherlock Holmes'

In [13]:
street_number * 10 # this is not how you add a zero at the end with strings

'221221221221221221221221221221'

In [14]:
street_number + '0' # this is how you add a zero at the end, we add strings to strings

'2210'

In [15]:
street_number = street_number + 'b' # this is how we modify the variable
print(street_number)

221b


In [16]:
name.upper() # string in all capitals

'SHERLOCK HOLMES'

In [17]:
address = street + ' ' + street_number # when combining strings it can be useful to add blank spaces between words

message = 'User {} is {} years old and {} meters tall. User {} lives at {}.' # combining different data types in one message

message.format(name, age, height, name, address) # add the variable names in the correct order

'User Sherlock Holmes is 42 years old and 1.96 meters tall. User Sherlock Holmes lives at Baker Street 221b.'

Below I am playing with string values in an input, trying to change everyone's name to Sherlock Holmes.

In [19]:
answer = input ('What is your name? ').lower()
if answer == 'sherlock holmes':
    print('Nice to meet you, Sherlock Holmes!')
else:
    answer2 = input('Your name is not Sherlock Holmes, would you be willing to change it? ').lower()
    if answer2 == 'yes':
        print('Awesome!!! Nice to meet you, Sherlock Holmes!')
    else:
        print('Well, it was worth a try! Nice to meet you' + '!')

What is your name?  Grumpy
Your name is not Sherlock Holmes, would you be willing to change it?  yes


Awesome!!! Nice to meet you, Sherlock Holmes!


## Data Manipulation

In [21]:
# filtering, sorting, and grouping

# filtering DayOfWeek 	Sales 	Customers 	Open 	Promo 	StateHoliday 	SchoolHoliday

df = df[['DayOfWeek', 'Sales', 'Customers', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday']] # I don't think I need all the columns, so I am filtering out the ones I don't need
df.head(5)

Unnamed: 0,DayOfWeek,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
0,5,5263,555,1,1,0,1
1,5,6064,625,1,1,0,1
2,5,8314,821,1,1,0,1
3,5,13995,1498,1,1,0,1
4,5,4822,559,1,1,0,1


In [22]:
# filtering out the value 0 for column Open (I want df to only show me data for when the stores are open)
#df [ (df[‘‘column name'] ==’column value’ )]
# newdf = df[(df.origin == "JFK") & (df.carrier == "B6")]

open_df = df[(df.Open == "1" )] # this isn't working and I'm not sure what I'm doing wrong
open_df.head(10)

Unnamed: 0,DayOfWeek,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday


In [23]:
#sorting

# I want to see the biggest sales first and the biggest numbers of customers

df.sort_values(by=["Sales", "Customers"], ascending=False)[["Sales", "Customers"]]

Unnamed: 0,Sales,Customers
44393,41551,1721
132946,38722,5132
101726,38484,5458
87231,38367,5192
424086,38037,1970
...,...,...
1017204,0,0
1017205,0,0
1017206,0,0
1017207,0,0


## Feature Engineering

In [25]:
print(df.dtypes)

DayOfWeek         int64
Sales             int64
Customers         int64
Open              int64
Promo             int64
StateHoliday     object
SchoolHoliday     int64
dtype: object


In [26]:
df.replace({0: False, 1: True}, inplace=True)
df.head(5)

Unnamed: 0,DayOfWeek,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
0,5,5263,555,True,True,0,True
1,5,6064,625,True,True,0,True
2,5,8314,821,True,True,0,True
3,5,13995,1498,True,True,0,True
4,5,4822,559,True,True,0,True


In [55]:
one_hot_encoded = pd.get_dummies(df, columns=['StateHoliday'])
print(one_hot_encoded.columns)

Index(['DayOfWeek', 'Sales', 'Customers', 'Open', 'Promo', 'SchoolHoliday',
       'StateHoliday_0', 'StateHoliday_a', 'StateHoliday_b', 'StateHoliday_c'],
      dtype='object')
