<figure>
  <IMG src="figures/logo-esi-sba.png" WIDTH=300 height="100" ALIGN="right">
</figure>

# Software Engineering for Data Science 
## Final Exam: Fall 2024-Winter 2025
## Exam Duration: 02 Hours

*By Dr. Belkacem KHALDI (b.khaldi@esi-sba.dz)*

## Revised Exam Instruction:

During this exam, you are permitted to refer only to your lecture notes and lab materials. Additionally, you are authorized to access internet to only download required packages if any. Any violation of this policy will be considered a violation of academic integrity. Violations may result in immediate sanctions, including a grade of zero for the specific exercise or challenge in question. It is imperative to adhere to the specified guidelines to maintain a fair and equitable assessment environment. Your cooperation is appreciated, and any concerns or questions should be directed to the exam proctor or instructor. Thank you for your understanding and compliance.

## Note: Continued Evaluation Grade:
Half mark from this exam will be incorporated into your final Continuous Evaluation Grade.

## Challenge 1: Git Practice
1. Execute the required Git commands that allows you to reach the goal (01) illustrated in the following figure:

![Git Challenge 1](figures/Git_goal1.png)

2. Now make the appropriet git commandes  that allows you reach  goal (02) from goal (01) shown in the following Figure. 

![Git Challenge 2](figures/Git_goal2.png)

3. Then make the appropriet git commandes  that allows you reach  goal (03) from goal (02) shown in the following Figure. 

![Git Challenge 2](figures/Git_goal3.png)

In [183]:
### Your Solution here. All your git commands have to be illustrated in a Markdown cell type.
### Include the final figure of your git log commande in the resulting Markdown cell type as image.


## Challenge 2: Data Wrangling, Scraping, and Cleaning for eBooks.com


### Objective:
You have just taken a role as a Data Scientist at eBooks.com publishing company. Your task is to gather and clean data related to specific books from various sources. This includes scraping data for computer science-related books, querying a specific database, as well as reading CSV and JSON files. Subsequently, you will perform necessary data cleaning, data analysis, and data visualization to ensure the quality and usability of the data.


#### Key Points to Note:
- **Comprehensive Data Integration:** 
   You should integrate data from a CSV file, a JSON file, a SQLite database, and web scraping into one unified DataFrame.

- **Data Cleaning:**
    Emphasis should be placed on cleaning the data to ensure consistency, accuracy, and usability.

- **Handling Failures:** 
   You are encouraged to perform data integration to the best of your ability, even if you fails with some of the following requirements. The final cleaned CSV file should include data from the successfully ingested sources, while clearly documenting any issues faced with the other sources in the notebook. You should save the final successfull cleaned CSV file as `\resources\MyeBooksComCleaned.csv`, which will be used later for `Challenge 03: Developing a Streamlit Web Application for eBooks.com`.
   

<figure>
  <IMG src="figures/sqllite_db_diagram.png"  ALIGN="right">
</figure>

### Requirements:
#### 1. Ingesting data from csv and json files:
   - **CSV Files**:
      - Read and parse the data from the CSV file (`\resources\ebook_data_visualization.csv`) provided by the company into a dataframe called `df_csv`  .
      - Ensure to handle inconsistent data formats and Standardize the data formats to ensure consistency.
   - **json Files**:
      - Read and parse the data from a JSON file (`\resources\ebooks_data_json.json`) provided by the company into a dataframe called `df_json` .
      - Ensure to handle inconsistent data formats and Standardize the data formats to ensure consistency.

#### 2. Ingesting data from a SQLite Database :
The company has provided you with its SQLite database located at `\resources\ebook_data_warehousing_database.db`, which contains a subset of its data. Refer to the provided database diagram for details about the database tables and their relationships.
   - Craft an appropriate SQL query to retrieve data from the relevant tables in the provided database. 
   - Ingest the resulting data into a dataframe called `df_db`, ensuring the dataframe structure and columns are prepared for later combination with data from other company sources (CSV, JSON, web scraping).
   - Ensure to handle inconsistent data formats and Standardize the data formats to ensure consistency.


#### 3. Ingesting data from Web Scraping:
<figure>
  <IMG src="figures/SCRAPPING.png"  ALIGN="right">
</figure>

   - Use Python along with a web scraping library such as `BeautifulSoup` to scrape book details from the following (03) three local stored webpages:
      - `\resources\ebook_web_pages\Buy_Computers_Artificial_Intelligence_eBooksOnline_Page1.html`
      - `\resources\ebook_web_pages\Buy_Computers_Artificial_Intelligence_eBooksOnline_Page2.html`
      - `\resources\ebook_web_pages\Buy_Computers_Artificial_Intelligence_eBooksOnline_Page3.html`
   -  Hints for reading local files:
      ``` python
         import os
         # Create the full file path by joining the base path and the file name where:
         # base_path is the Base directory path where the files are located
         # and file_name is the File name to be read
         file_path = os.path.join(<base_path>, <file_name>)

         # Open the file in read mode with UTF-8 encoding
         with open(file_path, 'r', encoding='utf-8') as file:
            # Read the entire content of the file
            content = file.read()
      ```
   - Scrape the details illustrated in the figure for each book into a dataframe called `df_wscrap` with the following columns names: `Title`,`Author`,`Price`,`Publishing Date`, `Book Type`, `Publisher`, and `Description`.
      - Note that the content of the `Book Type` is a fixed value that is `Artificial Intelligence` for all scrapped items.
      - Ensure to handle inconsistent data formats and Standardize the data formats to ensure consistency.
      -  You may consider using the Browser Dev. Tools for further assistance and html componenets inspections.


#### 2. Data Integration, Processing, Cleaning:
1. **Data Validation,  Cleaning, and Integration**:
   - Handle missing data in the dataframes.
   - Remove any duplicate entries.
   - Standardize columns names and the Publishing Date format to `%Y-%m`, `%m-%Y`, or `%B %Y` where:
      - `%Y`: Year with century as a decimal number (e.g., 2025).
      - `%m`: Month as a zero-padded decimal number (e.g., 01 for January, 12 for December).
      - `%B`: Full month name (e.g., January, February).
      - Examples:
         - `%Y-%m`: This format represents a date string where the year comes first, followed by the month (e.g., 2025-01).
         - `%m-%Y`: This format represents a date string where the month comes first, followed by the year (e.g., 01-2025).
         - `%B %Y`: This format represents a date string where the full month name comes first, followed by the year (e.g., January 2025).
   - Convert prices to a single currency (e.g., `US$`) using a specified exchange rate: `1 US$=0.96 Euros`.
   - Handle any possible outliers values in the prices.
   - Combine all dataframes into one single cleaned dataframe and save it into csv file called `\resources\MyeBooksComCleaned.csv`.


#### 3. Data Analysis and Visualization:
1. Perform basic analysis to find and visualize:
   - The top n=5 common Publishers by Book Type with their average book prices.
   - The top n=5 common Authors by Book Type with their average book prices.
      
2. Perform basic text analysis to find and visualize:
   - The most frequent words as a word cloud from all book descriptions.
   - The most frequent words as a word cloud from book descriptions published by a specific publisher.
   - The most frequent words as a word cloud from book descriptions of a specific book type.

3. Study and plot the probability distribution of book prices by Book Type.

### Your Solution

#### Note: Additional points will be awarded for code that is well-documented, refactored, and easy to read.

Please provide step-by-step code for each part of the required challenge task..


In [None]:
#Your solutions


## Chalenge 03: Developing a Streamlit Web Application for eBooks.com

Using your cleaned CSV file `\resources\MyeBooksComCleaned.csv`, you are tasked with creating a Streamlit web application that includes the following functionalities:

1. **Exploratory Data Analysis (EDA)**:
   - Perform basic EDA on the dataset to understand its structure and key features.

2. **Basic Analysis and Visualization**:
   - Identify and visualize the top 5 common publishers by book type along with their average book prices.
   - Identify and visualize the top 5 common authors by book type along with their average book prices.

3. **Text Analysis and Visualization**:
   - Generate and visualize a word cloud of the most frequent words from all book descriptions.
   - Generate and visualize a word cloud of the most frequent words from book descriptions published by a specific publisher.
   - Generate and visualize a word cloud of the most frequent words from book descriptions of a specific book type.

4. **Probability Distribution Analysis**:
   - Analyze and plot the probability distribution of book prices by book type.

5. **Additional Functionalities**:
   - Allow users to upload their own CSV file for analysis.
   - Organize the Streamlit app into separate tabs or pages for each functionality to improve user experience and navigation.



In [None]:
### Your Solution:
### Include screenshots of your Streamlit Web App in Markdown cell type as images.

## Chalenge 04: Code Refactoring Techniques
### part one:
What refactoring techniques can be applied to the following Python code to enhance its readability? Please highlight the main issues and suggest specific refactoring methods to address them.
1. #### Code 1:

    ```python
    import pandas as pd

    # Load the dataset
    df = pd.read_csv('data.csv')

    # Calculate mean and median of a column
    mean_value = df['column'].mean()
    median_value = df['column'].median()

    # Print the results
    print(f"Mean: {mean_value}, Median: {median_value}")
    ```

2. #### Code 2:
   
    ```python
    import pandas as pd

    # Load the dataset
    data = pd.read_csv('data.csv')

    # Filter the data
    f = data[data['column'] > 10]

    # Calculate the sum
    s = f['column'].sum()

    # Print the result
    print(f"Sum: {s}")
    ```    


### Part Two:
You are required to create a Python package that includes the refactored code from the two above provided code snippets. Ensure that the package is well-documented, structured, and does not rely on fixed file name strings, fixed dataframe columns names, or fixed parameters. Additionally, test the installation of this package using an offline installation method
The results of the offline installation test should be demonstrated here in a Jupyter notebook cell.

In [None]:
### Your Solution here.

# Final Remarks:

1. Zip the entire folder that contains your jupyther notebook, and git snapshots, as well as your streamlit folder in one file named as your esi-sba  email 
2. Send the compressed file to b.khaldi@esi-sba.dz