# **Dataset Research**

## **Part 1: Identifying and Defining**
### **The Scenario and Purpose**
**Data:** My chosen scenario includes a comprehensive list of Hollywood movies with the highest box office gross in the world. This is based on many different factors, such as the movie name, budget, distributor, worldwide sales, genre and release date. 

**Goal:** My goal is to attempt, discover and learn the top 100 highest-grossing Hollywood movies of all time, with the help of the dataset.

**Access:** The dataset is published onto Kaggle with a public status, meaning anyone can view, discuss and download it.

**Access Method:** The dataset can be downloaded on Kaggle as a CSV file, and will be analyzed in the CSV file format.

**Source:** https://www.kaggle.com/datasets/sanjeetsinghnaik/top-1000-highest-grossing-movies/data
### **Functional Requirements**
#### **Data Loading**
In terms of data loading, the dataset must be suitable for analysis. The dataset must be downloaded or converted into an appropriate file format for use, such as the CSV file format, as the CSV file format is typically the best to perform data analysis procedures in.
* **Input:** The user will need to input an appropriate file format such as .CSV
* **Output:** There wouldn't be an output.
#### **Data Cleaning**
In terms of data cleaning, the system will need to be able to fix or remove incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within the dataset. In terms of achieving this, the system will need to be able to identify data discrepancies using data observability tools, remove unnecessary values and duplicate data, fix structural errors, address any missing values, standardize data entry and formatting, and develop a data quality strategy.
* **Input:** The user will need to filter data using either Pandas or Microsoft Excel.
* **Output:** If done correctly, a neat and processed dataset should be outputted by the program used.
#### **Data Analysis**
In terms of data analysis, the statistical analysis will need to be able to obtain raw data and subsequently convert it into information useful for decision-making by users. Data will collected and analyzed to answer questions, test hypotheses, or disprove theories.
* **Input:** The user will need to run a program capable of processing and analyzing data accurately.
* **Output:** If done correctly, the used program will output the necessary analysis results requested by the user.
#### **Data Visualization**
In terms of data visualization, the data from the dataset will need to be represented visually through use of common graphics, such as charts, plots, infographics and even animations.
* **Input:** The user will first need to determine what type of graphics they want to visualize the data in, then they will need to specify the type of graphics through the Matplotlib module.
* **Output:** The system will need to output the specified graphics requested by the user. The graphics outputted must be able to present the data correctly and easily.
#### **Data Reporting**
In terms of data reporting, the system will need to collect unprocessed data from different sources that can later be organized into meaningful pieces of information. These pieces of information can give valuable insights and statistics about the topic of the collected data.
* **Input:** The user will need to request statistical information from the program.
* **Output:** The system will need to output the statistical information the user requested in a fully valid form with zero errors.
### **Use Cases**
#### **Data Loading**
**Actor:** User

**Goal:** To import the correct file format as a dataset into the system.

**Preconditions:** User has a dataset ready to be imported.
> Step 1: User organizes the files and places the dataset into the appropriate folder to be read.
>
> Step 2: System scans and reads through the dataset, then validates the dataset after approval.
>
> Step 3: After validation, the system displays the data in a dataframe.

**Postconditions:** THe dataset is ready for cleaning.
#### **Data Cleaning**
**Actor:** User

**Goal:** To fix or remove incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.

**Preconditions:** The system has a loaded dataset to clean.
> Step 1: The user runs a program built to filter out invalid data efficiently.
>
> Step 2: The program reads the dataset. If any invalid data is detected, the program will filter them out.
>
> Step 3: The program presents the cleaned dataset as a valid dataframe.

**Postconditions:** The dataset is filtered, cleaned and ready for the next stages of analysis.
#### **Data Analysis**
**Actor:** User

**Goal:** To analyse the data to extract important pieces of it and make summaries.

**Preconditions:** The dataset has been thoroughly filtered and cleaned.
> Step 1: Prior to producing the analysis, the user selects the analysis method.
>
> Step 2: The user runs a program built to analyze data efficiently.
>
> Step 3: The program performs the analysis and presents the results in an organized format.

**Postconditions:** The user has access to the analysis results.
#### **Data Visualization**
**Actor:** User

**Goal:** To create visual representations of the data.

**Preconditions:** The dataset has been analyzed and the insights are defined.
> Step 1: Prior to producing the visualization, the user chooses a visualization type they want for the data.
>
> Step 2: User runs a program built to visualize data in an organized format.
>
> Step 3: The program reads the requests and generates the wanted visualization.

**Postconditions:** Visualizations are created and ready for inclusion in reports.
#### **Data Reporting**
**Actor:** User

**Goal:** To generate a report summarizing the data analysis and visualizations.

**Preconditions:** Data analysis and visualizations are fully completed and prepared.
> Step 1: User selects report content and format.
>
> Step 2: User runs a program built to generate data reports.
>
> Step 3: The program compiles the data and generates the report.

**Postconditions:** The report is generated and ready for distribution or presentation.

## **Part 2: Researching and Planning**
### **Research of Issue and Security Requirements**

The purpose of my project is to find out the most popular Hollywood movies in the world. This project needs to be cunducted as 

#### **What should an application have to be secure?**
An application requires many crucial security factors in order to be secure. The most important things that any application should have developed in order to build basic security are proper, reliable user authentication and authorization systems, which can protect against unauthorised access by ensuring that only authorised individuals or systems can access the data. Some more important security practices that developers should exercise is to add input validation, data storing, and secure encryption systems. Input validation ensures that only properly formed data is entering the workflow in an information system, which can prevent malformed data from persisting in the database. Data storage systems help keep data and makes them easily accessible. Encryption is extremely important as it plays a major role in securing data from attackers and other cybersecurity threats. Aside from those, developers themselves should also use secure protocols and APIs, implement security testing, exercise error handling and logging skills, and make sure to maintain the application frequently.
#### **Terminology**
##### **User Authentication:** A process built in computer systems by developers that verifies a person's identity, allowing them access to an online service, connected device, or other resource.
##### **Password Hashing:** A security process that turns a password into a short string of letters and numbers using an encryption algorithm. This process prevents cybercriminals from gaining access to people's passwords if a website is hacked.
##### **Encryption:** A process that scrambles, converts and transforms data into incomprehensible text known as ciphertext, so that only authorized parties can understand and decode the information. Encryption plays a major role in data privacy.
### **Data Dictionaries**
| Field | Data Type | Display Format | Description | Example | Validation |
|:- | :- | :- | :- | :- | :- |
| Title | object | XX...XX | The title name for the movie. | Taken | Can include any character and can be however long.|
| Year | int64 | YYYY | The year the movie was released in. | 2008 | Must be a 4-digit number. |
| Distributors | object | XX...XX, ... , XX...XX | The companies responsible for the distribution of the movie. | EuropaCorp, FilmFlex, 20th Century Studios | Can contain all letters, symbols, and numbers. |
| Budget (USD) | int64 | NNNNNNNNN | The budget of the movie in USD. | 035000000 | Can range up to 9 numbers but must not contain any letters. |
| Worldwide Sales (USD) | int64 | NNNNNNNNNN | The worldwide box office of the movie in USD. | 226800000 | Can range up to 10 numbers but must not contain any letters. |
| Genres | object | XX...XX, ... , XX...XX | The genre/s of the movie. | Action, Thriller, Mystery, Drama | Can contain letters and symbols but must not contain any numbers. |