# DATA INGESTION AND WRANGLING


## 1) Formats chosen: 

* .jpg/.jpeg
* .mp4
* .xlsx
* .json
* .csv


## 2) Python functions for loading data in different formats 

### 2.1) load_jpeg_image

In [130]:
from PIL import Image
from IPython.display import display

def load_jpeg_image(file_path):
    """
    Load a JPEG image from a file.
    
    :param file_path: Path to the JPEG image file.
    :return: An Image object.
    """
    image = Image.open(file_path)

    #display(image)
    return image

# Example usage:
image = load_jpeg_image('C:/Users/acm11/Pictures/Mønster-prøve.jpg')

# This opens the image in a (the computers default?) application for showing images:
#image.show()



### 2.2) load_mp4_video

In [16]:
import cv2
import time
import matplotlib.pyplot as plt

def load_mp4_video(file_path):

    # Creating a VideoCapture object (which create a way) to read video(frames) from file
    video = cv2.VideoCapture(file_path)

    # Initializing a variable with a list for video frames
    frames = []

    # Start timing
    start_time = time.time()
    
    print("Starting to load frames...")
    

    while True:
        # video.read() reads the next frame in the video and updates which frames has been read in the video(VideoCapture) object
        # video.read() returns a 1) boolean value indicating if the frame has been read succesfully and 2) a frame
        fr_read_succes, frame = video.read()
        
        if not fr_read_succes:
            break
            
        frames.append(frame)

    # Stop timing
    end_time = time.time()
    elapsed_time = end_time - start_time

    # Closing file video object was associated with/stops capturing video from a live camera or other live video source
    video.release()

    # Print summary
    print(f"Frames loaded: {len(frames)}")
    print(f"Total time taken: {elapsed_time:.2f} seconds")
    
    return frames


frames = load_mp4_video('C:/Users/acm11/Videos/SodaVideo.mp4')
    

Starting to load frames...
Frames loaded: 329
Total time taken: 18.12 seconds


### 2.3) load_excel_file

In [97]:
import pandas as pd

def load_excel_file(file_path, headerChosen):
    df = pd.read_excel(file_path, header=headerChosen)
    return df

# Example usage
excel_data1 = load_xls('C:/Users/acm11/OneDrive/Dokumenter/test_excel_familien.xlsx', 0)
excel_data2 = load_xls('C:/Users/acm11/OneDrive/Dokumenter/test_excel_familien.xlsx', 1)

#print(excel_data1)
#print(excel_data2)

In [142]:
first_frame_rgb = cv2.cvtColor(frames[0], cv2.COLOR_BGR2RGB)

#plt.imshow(first_frame_rgb)
#plt.title("RGB Image of first frame in soda video")
#plt.axis("off")
#plt.show()


first_frame_image = Image.fromarray(first_frame_rgb)

# Displays the image in another application
#first_frame_image.show(title='First Frame')

# cv2.destroyAllWindows()

# Displaying the image as an array
#display(first_frame_rgb)

### 2.4) load_json_file

In [183]:
import json

def load_json_file(file_path):

    with open(file_path, 'r') as file:
        data = json.load(file)

    df = pd.DataFrame(data)

    return df
    
# Example usage
#df = load_json_file('C:/User/SomeDirectory/SomeFile.json')


### 2.5) load_csv_file

In [77]:
def load_csv_file(file_path):

    df = pd.read_csv(file_path, header=0)

    return df

healthcare_df = load_csv_file("C:/User/SomeDirectory/SomeFile.json")

## 3) Using my functions to create Data Frames

* 3.1) XLS -> Data Frame
* 3.2) JSON -> Data Frame
* 3.3) CSV -> Data Frame

### 3.1) XLS -> Data Frame


In [170]:
uspres_df = load_excel_file("C:/Users/acm11/BusinessIntelligencedat4/US_Presidents.xlsx", 0)

### 3.2) JSON -> Data Frame

In [185]:
iris_df = load_json_file("C:/Users/acm11/BusinessIntelligencedat4/iris.json")

### 3.3) CSV -> Data Frame

In [220]:
alcohol_df = load_csv_file("C:/Users/acm11/BusinessIntelligencedat4/Alcohol_effect_on_students.csv")

## 4) Exploration and cleaning of data

### 4.1) Exploring

#### 4.1.1) Eploring US Presidents

##### 4.1.1.1) The 5 first rows: .head()

In [172]:
uspres_df.head()

Unnamed: 0.1,Unnamed: 0,S.No.,president,prior,party,vice,salary,date updated,date created
0,0,1,George Washington,Commander-in-Chief of the Continental Army ...,Nonpartisan,John Adams,5000,2021-07-14,2012-03-04
1,1,2,john adams,1st Vice President of the United States,Federalist,Thomas Jefferson,10000,2021-07-14,2012-03-04
2,2,3,Thomas Jefferson,2nd Vice President of the United States,Democratic- Republican,Aaron Burr,15000,2021-07-14,2012-03-04
3,3,4,James Madison,5th United States Secretary of State (1801â...,Democratic- Republican,George Clinton,20000,2021-07-14,2012-03-04
4,4,5,JAMES MONROE,7th United States Secretary of State (1811â...,Democratic- Republican,Daniel D. Tompkins,25000,2021-07-14,2012-03-04


##### 4.1.1.2) Data types: .dtypes

In [174]:
dtypes_series = uspres_df.dtypes
dtypes_df = pd.DataFrame([dtypes_series.index, dtypes_series.values], index=['Column Name', 'Data Type'])
dtypes_df

Unnamed: 0,0,1,2,3,4,5,6,7,8
Column Name,Unnamed: 0,S.No.,president,prior,party,vice,salary,date updated,date created
Data Type,int64,int64,object,object,object,object,int64,datetime64[ns],datetime64[ns]


##### 4.1.1.3) All missing values (NA or NaN): .isna().sum()

In [176]:
na_values = uspres_df.isna().sum()
type(na_values)

pandas.core.series.Series

In [178]:
na_val_df = pd.DataFrame(na_values)
na_val_df

Unnamed: 0,0
Unnamed: 0,0
S.No.,0
president,0
prior,0
party,0
vice,0
salary,0
date updated,0
date created,0


#### 4.1.2) Exploring Iris data

##### 4.1.2.1) All missing values (NA or NaN): .isna().sum()

In [205]:
iris_df.isna().sum()

sepalLength    0
sepalWidth     0
petalLength    0
petalWidth     0
species        0
dtype: int64

#### 4.1.3) Exploring Alcohol use and s data

##### 4.1.3.2) All missing values (NA or NaN): .isna().sum()

In [222]:
alcohol_df.isna().sum()

Timestamp                                                                                             0
Your Sex?                                                                                             2
Your Matric (grade 12) Average/ GPA (in %)                                                            7
What year were you in last year (2023) ?                                                             73
What faculty does your degree fall under?                                                             7
Your 2023 academic year average/GPA in % (Ignore if you are 2024 1st year student)                   86
Your Accommodation Status Last Year (2023)                                                           23
Monthly Allowance in 2023                                                                            31
Were you on scholarship/bursary in 2023?                                                              8
Additional amount of studying (in hrs) per week                 