# The field of Data Science and popular techniques for working with data: 

<hr>

## **Big Data**

<hr>

## **Techniques for Working with Big Data**

<br>

While many techniques for working with **big data** are conceptually similar to those used with traditional data, big data introduces **new challenges related to scale, speed, and variety**. As a result, additional techniques and technologies are required.

Just like traditional data, big data must still be **collected, processed, and transformed into usable information**. However, the nature of the data significantly changes *how* this is done.

<br>

<span style="color: lightgreen;">Raw Big Data → Distributed Processing → Information</span>

<br><hr style="display: flex; width: 50%; margin: auto;"><br>

## **1. Raw Big Data / Data Collection**

Raw big data refers to **large-scale, fast-moving, and diverse datasets** generated continuously from multiple sources.

### Characteristics of Raw Big Data
- Very large volume
- Generated at high speed (near real-time)
- Comes in many formats
- Often noisy and incomplete
- Stored across distributed systems

### Common Sources
- Social media platforms
- Sensor and IoT devices
- Financial market systems
- Log files and clickstream data
- Multimedia platforms (images, video, audio)

Raw big data **cannot be analysed directly** and must first undergo distributed processing.

---

## **2. Data Processing (Big Data Pre-processing)**

Because of its scale and complexity, big data processing is usually performed across **multiple machines** using distributed frameworks.

### **2.1 Extended Class Labelling**
In traditional data, class labelling focused mainly on:
- numerical data
- categorical data

With big data, class labelling expands to include:

- **Numerical data**
- **Text data**
- **Digital images**
- **Digital video**
- **Digital audio**
- **Semi-structured data** (JSON, XML)
- **Unstructured data**

Each data type requires **specialised processing techniques**.

---

### **2.2 Data Cleansing at Scale**
Due to large volumes and high variety, cleansing big data is more complex.

Includes:
- removing duplicates across distributed systems
- correcting inconsistencies
- noise reduction (especially for sensor, audio, and image data)
- filtering irrelevant or low-quality data

Cleansing is often automated due to the dataset size.

---

### **2.3 Handling Missing and Incomplete Data**
Missing values are more common in big data due to:
- sensor failures
- interrupted data streams
- incomplete user inputs

Techniques include:
- real-time imputation
- data stream buffering
- probabilistic estimation
- ignoring incomplete records at scale

---

### **2.4 Data Transformation**
Big data often needs to be transformed before analysis:
- text → tokens or embeddings
- images → pixel matrices or feature vectors
- audio → frequency or waveform features
- time-series alignment

Transformation allows unstructured data to become mathematically analysable.

---

### **2.5 Distributed Storage and Processing**
Unlike traditional data, big data requires:
- distributed file systems
- parallel processing
- fault tolerance

Examples:
- data lakes instead of data warehouses
- distributed computing clusters
- batch and streaming processing

This is a **core technique unique to big data**.

---

## **3. Additional Techniques 

### ✅ **Scalability**
Systems must automatically scale to handle:
- increasing data volume
- higher velocity
- more data sources

### ✅ **Real-Time & Stream Processing**
Big data is often processed:
- continuously
- in near real-time

This enables:
- live dashboards
- fraud detection
- real-time monitoring

### ✅ **Data Reduction at Scale**
To manage size and cost:
- sampling large streams
- summarising data
- windowing time-series data
- feature extraction

---

## **4. Case-Specific Scenarios**

### **4.1 Text Data Mining**
- extracting keywords
- sentiment analysis
- topic modelling
- document clustering

### **4.2 Confidentiality & Privacy**
Because big data often includes personal information, techniques such as:
- data anonymisation / data masking
- privacy-preserving data mining
- access control
- encryption

are essential.

---

## **5. Real-Life Examples of Big Data**

### **Social Media Data**
- extremely high variety (text, images, video)
- massive volume
- high velocity (posts, likes, comments)

### **Financial Trading Data**
- recorded every second or millisecond
- massive transaction streams
- requires real-time analytics and storage

---

## **6. Outcome: Information**

After large-scale processing, big data is transformed into information used for:
- predictive analytics
- machine learning
- pattern detection
- real-time decision-making
- AI-driven systems

---

### **Key Difference from Traditional Data**
Traditional data techniques focus on **structure and accuracy**.  
Big data techniques focus on **scalability, automation, and distributed intelligence**.