# Data Wrangling

### What Is Data Wrangling?

Data wrangling, also known as data munging or data cleaning, is the process of transforming and mapping raw data into a format that is suitable for analysis. This process involves cleaning and structuring raw data into a usable and organized form, making it easier to work with and analyze.

#### Why it matters?

* Data wrangling enables you to gather data from multiple sources into a central spot.
* Cleaning and converting data into a standard format enables you to perform cross-data set analytics.
* Data wrangling prepares data by removing flawed and missing elements, readying it for data mining, and empowering businesses to make concrete, data-driven decisions.

### Common Steps Involved in Data wrangling 

* Discovery : The first step helps you make sense of the data you're working with. You'll also need to keep the primary goal of the data analysis during this step
* Structuring : This is the process in which you transform that raw data into a form appropriate for the analytical model you want to use to interpret the data.
* Cleaning & Transformation :  tasks like standardising inputs, deleting empty cells, removing outliers, and deleting blank rows. Ultimately, the goal is to ensure the data is as error-free as possible. 
* Enriching : When you have transformed your data into a more usable state, you must determine if you have all the data you need for the project. If you don't, you can enrich it by adding values from other data sets
* Validation : you might find some issues you need to address or that the data is ready to be analysed

EXECUTION OF DATA WRANGLING STEPS IN PYSPARK (Simple Hands on Data Wrangling using Pyspark)

In [65]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import col, avg,when

In [66]:
spark=SparkSession.builder.appName('Data_wrangling').getOrCreate()

In [67]:
#sample data Set to work with data wrangling (Self generated)
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", StringType(), True),
    StructField("Gender", StringType(), True),
    StructField("Marks", StringType(), True)
])
data = [('Aadi', 17, 'M', 90),
        ('Deeksha', 17, 'F', 76),
        ('Jincy', 18, 'F', 'NaN'),
        ('Keerthi', 17, 'F', 74),
        ('Harish', 18, 'M', 65),
        ('Anu', 17, 'F', 'NaN'),
        ('Ram', 17, 'M', 71)]

df_sample = spark.createDataFrame(data, schema=schema)

1. DATA EXPLORATION:
Understanding the data.

In [81]:
df_sample.show()

+-------+---+------+-----+
|   Name|Age|Gender|Marks|
+-------+---+------+-----+
|   Aadi| 17|     M|   90|
|Deeksha| 17|     F|   76|
|  Jincy| 18|     F|  NaN|
|Keerthi| 17|     F|   74|
| Harish| 18|     M|   65|
|    Anu| 17|     F|  NaN|
|    Ram| 17|     M|   71|
+-------+---+------+-----+



In [68]:
df_sample.summary().show()

+-------+----+-------------------+------+-----+
|summary|Name|                Age|Gender|Marks|
+-------+----+-------------------+------+-----+
|  count|   7|                  7|     7|    7|
|   mean|NULL| 17.285714285714285|  NULL|  NaN|
| stddev|NULL|0.48795003647426616|  NULL|  NaN|
|    min|Aadi|                 17|     F|   65|
|    25%|NULL|               17.0|  NULL| 71.0|
|    50%|NULL|               17.0|  NULL| 76.0|
|    75%|NULL|               18.0|  NULL| 90.0|
|    max| Ram|                 18|     M|  NaN|
+-------+----+-------------------+------+-----+



2. DEALING WITH MISSING VALUES:- 
 Here, the null values present in the data  are removed and replaced with the mean value.

Null values can be handeled many ways according to the Dataset
,Here have replace the NULL value with 'mean'

In [69]:
df_sample.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Marks: string (nullable = true)



In [82]:
#To find the mean of "Marks" column
#We have to typecast  "Marks" to Integer ,As the Data type of column are String
df_casted_dtypes = df_sample.withColumn("Marks", df_sample["Marks"].cast(IntegerType())).withColumn("Age", df_sample["Age"].cast(IntegerType()))
avg_marks = df_casted_dtypes.select(avg(col('Marks'))).collect()[0][0]
print(f'Avg  marks obtained {avg_marks}')

Avg  marks obtained 75.2


In [83]:
df_casted_dtypes.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Marks: integer (nullable = true)



In [84]:
#As we obtained "Avg_marks" we replace the value with Null value in marks column 
df_null_replaced = df_sample.withColumn('Marks', when(col('Marks') == 'NaN', avg_marks).otherwise(col('Marks')))

In [85]:
df_null_replaced.show()

+-------+---+------+-----+
|   Name|Age|Gender|Marks|
+-------+---+------+-----+
|   Aadi| 17|     M|   90|
|Deeksha| 17|     F|   76|
|  Jincy| 18|     F| 75.2|
|Keerthi| 17|     F|   74|
| Harish| 18|     M|   65|
|    Anu| 17|     F| 75.2|
|    Ram| 17|     M|   71|
+-------+---+------+-----+



3. RESHAPING THE DATA,
The categorical values can be represented by a numerical value.
As the data contain categorical values in the gender column, it can be reshaped by categorizing them into numbers.

In [86]:
df_column_reshaped = df_null_replaced.withColumn('Gender', when(col('Gender') == 'M', 1).otherwise(0))

In [87]:
df_column_reshaped.show()

+-------+---+------+-----+
|   Name|Age|Gender|Marks|
+-------+---+------+-----+
|   Aadi| 17|     1|   90|
|Deeksha| 17|     0|   76|
|  Jincy| 18|     0| 75.2|
|Keerthi| 17|     0|   74|
| Harish| 18|     1|   65|
|    Anu| 17|     0| 75.2|
|    Ram| 17|     1|   71|
+-------+---+------+-----+



4. FILTERING,
Here the data is restructured to the specific format by removing the unwanted data in a table.

In [88]:
Sample_filtered_df = df_column_reshaped.filter(df_column_reshaped['Marks'] >= 75).drop('Age')

In [89]:
Sample_filtered_df.show()

+-------+------+-----+
|   Name|Gender|Marks|
+-------+------+-----+
|   Aadi|     1|   90|
|Deeksha|     0|   76|
|  Jincy|     0| 75.2|
|    Anu|     0| 75.2|
+-------+------+-----+



## CONCLUSION 
 To this end, It is understood how important data wrangling for data, 
and its potential to change the whole process upside down. 
The foundation of data science comes from good data. 
Hence optimized results can be obtained from optimized data to get optimized outcomes. 
Hence wrangle the data, before processing it for analysis