# Big Data.

Big data refers to extremely large and complex sets of information that grow rapidly over time. It is characterized by three main features, often called the “Three V’s”:

- **Volume** : The sheer amount of data.
There are exabytes, zettabytes, and yottabytes of data! What are these?
<br/>
One gigabyte is roughly equivalent to one billion bytes.
<br/>
One exabyte is one billion gigabytes.
<br/>
One zettabyte is approximately equal to one thousand exabytes. 
<br/>
One yottabyte is one thousand zettabyes.
<br/>
The increasing amount of data sources drives the volume of data. Think about it: as of 2022, the world population is nearly eight billion people, and the majority of people are using digital devices. These mobile devices, computers, phones, game consoles, and tablets all generate, capture, and store data.
Companies need to use tools and storage that enable them to handle such large volumes of data.

- **Velocity** : The speed at which new data is generated and processed.

- **Variety** : The various types of data, including structured and unstructured.

- **Veracity** : This refers to the quality and trustworthiness of data.

- **Value** : This refers to the ability to turn data into value. The main reasons why people invest time to understand big data is to derive value from it.
Value isn’t just profit. It might be medical or social benefits, or it might be customer, employee, or personal satisfaction. Companies must make a case and have a clear understanding of the value they want to obtain from collecting and using big data. They must filter out the “noisy” data to find what they are looking for.




Big data is used in various fields to analyze trends, make predictions, and solve complex problems.

# Data Analytics.

There are 4 recognized types of data analytics :

- *Descriptive analytics*
- *Diagnostic analytics*
- *Predictive analytics*
- *Prescriptive analytics*

Each type of data analytics has a different goal and a different place in the data analysis process, and answers a specific question. Take a moment to view the following diagram, which clarifies the degree of complexity and added-value contribution for each of the four data analytics types. 


<img src = '../images/data-analytic-complexity.png'>

As depicted in the previous diagram, the degree of difficulty and resources required increases for each type of data analytics. At the same time, the level of added insight and value also increases.

### Descriptive analytics: What is happening?
Descriptive analytics is the simplest and most common type of data analytics.

 

Descriptive analytics answers the question, “What is happening?”. It provides a snapshot of business trends and patterns and uses historical and current data.

 

Descriptive analytics manipulates raw data from multiple sources to give a data analyst valuable insights into the past and a view of key metrics within a business. 

 

These findings might signal that something is right or wrong but not explain why. However, the findings can help to determine what the biggest issues are and where to start investigating!

### Diagnostic analytics: Why is it happening?
After asking the question, “What is happening?”, the next step is to dive deeper and ask “why?”, such as,  “Why are trends and patterns happening?” This is where diagnostic analytics comes in.

 

Diagnostic analytics takes the insights found from descriptive analytics and drills down to find the causes of specific problems.

 

Businesses use of diagnostic analytics because it creates more connections between data and identifies patterns of behavior.

Here are some examples of diagnostic analytics:

- A freight company investigates the cause of slow shipments in a certain region.
- A healthcare company examines diagnoses and prescribed medications to identify the influence of medications.
- An IT company analyzes server ticket data to identify a small number of servers causing the bulk of an organization’s service outages.


### Predictive analytics: What is likely to happen in the future?
Predictive analytics is about forecasting. This type of analytics uses historical data to make predictions about the future. Whether it’s the likelihood of a future event, forecasting a quantifiable amount, or estimating a point in time at which something might happen – these are all done through predictive models.

 

In a world of great uncertainty, being able to predict allows businesses to make better decisions.

 

This type of analytics is more advanced and can often depend on machine learning and deep learning.

### Prescriptive analytics: What should happen?
Prescriptive analytics combines the insight from all previous data analyses to determine a course of action to take to address a problem or make a decision.

 

The purpose of prescriptive analytics is to prescribe what action to take to eliminate a future problem or take full advantage of a promising trend.

 

Prescriptive analytics is typically used for a host of actions, versus an individual action. This requires a major commitment from businesses to put forth the strategy, effort, and resources. As technology continues to improve and more professionals are educated in data, more companies will enter this data-driven realm.

Prescriptive analytics uses advanced tools and technologies, like machine learning, business rules, and algorithms. This makes prescriptive analytics sophisticated to implement and manage.

`Note that Prescriptive analytics recommends actions to take to eliminate a future problem or take advantage of a promising trend? `

<blockquote>
Note that the four steps to follow in the Data Analytics process are Collection, Cleaning, Analyzing, and Visualizing. 
So maybe CCAV? 
</blockquote>

## Data Analysis - Definition : 
Data analysis is the process of collecting, cleaning, and transforming data to obtain insights to help make better and informed decisions. In our ever-growing, data-driven world, this is a must for companies of all sizes to solve everyday business problems. Each company has its own team, processes, and tools for data analysis projects.

# A walk through of the CCAV process by an actual Data Analyst.

### Collect.
<blockquote>
“This is ‘square one’ in the process. This step is all about collecting the right data and just enough data for the project’s questions or problems that we want to research.

I first determine the data I can collect from any existing sources and databases that we already have that relate to the problem my company wants to solve. I always collect this data first! 

Then, I figure out if my project needs new sources of data because this could mean more time for the project and potentially more of an investment from my business group. 

My team and I use our company’s data collection tools and follow the data collection guidelines. We’re careful to securely store the data on our cloud servers, too.  

One crucial point I’d like to make is you have to collect enough in your data set, so you don’t skew the results of your analysis.”
</blockquote>

### Clean.
<blockquote>
“Next, not all the data I collect will be useful, so it’s time to clean it up!  

Data cleaning is the process of detecting and correcting missing or inaccurate records from a data set. 

A big part of this step is making sure that the data is in a usable format. This involves searching for what we call ‘outliers,’ dealing with null values, and looking for data that may have been incorrectly input. Simply put, raw data will have missing and inaccurate values that I need to address. 

You might have heard the term, ‘data wrangling’. I ‘wrangle’ the data so it’s in a usable format for my project in our database system. For example, I will search for duplicate records and remove them. 

No two data sets are the same, so how I clean the data can vary. I clean the data based on the context. In one case, seeing a blank entry might equal a zero entry, so it’s good and valuable data. But, in another case, seeing a blank entry could mean it’s incomplete data that I need to exclude. This is the art of data science!

Always save your data since this is an iterative process!

Oh, and here’s an interesting fact. This is where I spend most of my time, cleaning the data! I’d estimate that data analysts typically spend about 70-80% of their time cleaning data. It’s a lot of hard work. But, it’s a must so I can move on to analysis.”
</blockqoute>

### Analyze.
<blockquote>
“Once I have the relevant data and it’s cleansed, it’s time to analyze. This is the step where data analysts spend about 20-30% of their time. It’s the fun and rewarding part!

I get to be curious and investigate. And, my problem-solving skills come into play. Here, I use different statistical and analytical methods and software tools. It’s important that I align my methods of analytics, so they match the intent of the problem.

Basically, I identify issues and use analytics to determine the root causes of issues. I analyze trends, correlations, variations, and outliers to help me focus on answering the questions (and any questions or objections others might have).

As I manipulate data, I might find I have the exact data I need, but, more likely, I might need to revise my original questions or collect more data. This can drive additional analysis and is one reason to always save your data.”
</blockquote>

### Visualize.

<blockquote>
“Once I have the relevant data and it’s cleansed, it’s time to analyze. This is the step where data analysts spend about 20-30% of their time. It’s the fun and rewarding part!

I get to be curious and investigate. And, my problem-solving skills come into play. Here, I use different statistical and analytical methods and software tools. It’s important that I align my methods of analytics, so they match the intent of the problem.

Basically, I identify issues and use analytics to determine the root causes of issues. I analyze trends, correlations, variations, and outliers to help me focus on answering the questions (and any questions or objections others might have).

As I manipulate data, I might find I have the exact data I need, but, more likely, I might need to revise my original questions or collect more data. This can drive additional analysis and is one reason to always save your data.”
</blockquote>

## Extract Transform and Load (ETL).
<blockquote>
You may hear the term “ETL” used in computer-based work environments, in relation to data, data warehousing, and analytics. ETL is an acronym for extract, transform, and load (ETL).

ETL is a data integration process that combines data from multiple data sources into a single, consistent data store that is loaded into a data warehouse or other target system.

As databases grew in popularity in the 1970s, ETL was introduced as a process for integrating and loading data for computation and analysis, eventually becoming the primary method to process data for data warehousing projects.
</blockquote>

ETL provides the foundation for data analytics and machine learning workstreams. Organizations often use ETL to:

- Extract data from legacy systems.
- Cleanse the data to improve data quality and establish consistency.
- Load data into a target database


# Data Visualization.

<img src = '../images/Data Visualization.png'>

<blockquote>
A data visualization is a graphical display of abstract or complex information.
</blockquote>


Data analysts use visualizations like charts, graphs, and maps for two reasons:

 - To explore and interpret data during analysis to identify patterns or trends.
- To communicate results and help people understand the insights to make decisions.

**Data storytelling** is the process of converting data analyses into a simple, understandable story to influence a business decision. With the rise of digital business and data-driven decision making, data storytelling is an important skill. The idea is to “connect the dots” between the results and decision makers, who must be able to interpret the data.

Data storytelling involves a combination of *data*, *visualizations*, and *narrative*.

- When narrative is coupled with data, it explains to the audience what is happening in the data and why an insight is important.
- When visualizations are applied to data, they enlighten an audience with insights that they might not obtain without charts or graphs. Patterns and trends emerge from all the rows and columns in a database, with the help of data visualizations.
- When narrative and visualizations come together, they can create a data story that can influence, drive change, and engage an audience.

Data analysts use specific charts to visualize quantitative and qualitative data. The following image contains common charts for visualizing these two types of data. `Conceptual charts can show either quantitative or qualitative data.` Take a moment to study them.

<img src = '../images/Types_of_charts.png'>

### Types of data comparison and recommended graphs. 

**Relative proportion** is the amount or quantity of a subset present in the population of all data points. For example, if there are 25 students in a class of which 15 are girls and 10 are boys, then the proportion of girls is 15 out of 25 (3 of 5) and the proportion of boys is 10 out of 25 (2 of 5).

**Ranking** is the relationship between a set of items, in which one item is ranked higher, lower, or the same, compared to a second item. For example, video game players can be ranked in order by their highest score in a tournament.

**Time** is a series of data points that are listed or sequenced in time order, such as, for example, the daily time of high tide and low tide at a beach.

**Frequency** is the number of times a certain event occurs. For example, if it snows two times today, then the frequency of snow on this particular day is 2.

A **correlation** is the relationship between two random variables, which are typically related in linear way. For example, there is a correlation between the height of parents and their children.

<img src = '../images/Chart_recommendations.png'>

The goal for data visualization is to have a visualization that's : effective, attractive, and impactive.