# ENTITY RESOLUTION

### Description

This code initializes a PySpark environment for distributed data processing. It sets up necessary configurations and imports required modules.

In [None]:
import pyspark
import os
import sys
from pyspark import SparkContext
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

from pyspark.sql import SparkSession

## Configuring Spark Session

Configuring a SparkSession named 'chapter_2' with specific driver memory settings.

- **Configuring Driver Memory**: `.config("spark.driver.memory", "16g")` sets the driver memory to 16 gigabytes.
- **Creating SparkSession**: `SparkSession.builder.appName('chapter_2').getOrCreate()` creates a SparkSession with the specified configuration and application name.

Configuring the SparkSession with appropriate memory settings is crucial for managing resources effectively and optimizing performance, especially for memory-intensive tasks.


In [None]:
spark = SparkSession.builder.config("spark.driver.memory", "16g").appName('chapter_2').getOrCreate()


### Setting up the data and analyzing it

## Reading CSV Data

Reading CSV data from the specified file path using SparkSession.

- **CSV File Path**: The CSV data is read from the file path `"data/linkage/donation/block_1/block_1.csv"`.
- **Using SparkSession**: `spark.read.csv()` is used to read CSV files into a DataFrame.
- **DataFrame Creation**: The resulting DataFrame contains the data from the CSV file.

This operation loads the data from the CSV file into a DataFrame, allowing for further analysis and processing using Spark.


In [None]:
prev = spark.read.csv("data/linkage/donation/block_1/block_1.csv")

prev

## Displaying Data

Displaying the first two rows of the DataFrame `prev`.

- **DataFrame Display**: `prev.show(2)` is used to show the first two rows of the DataFrame.

Displaying a subset of the data provides an overview of its structure and content, facilitating further exploration and analysis.


In [None]:
prev.show(2)

## Reading CSV Data with Parsing Options

Reading CSV data from the specified file path using SparkSession and provides parsing options.

- **CSV File Path**: The CSV data is read from the file path `"data/linkage/donation/block_1/block_1.csv"`.
- **Parsing Options**:
  - `header="true"`: Specifies that the first row contains the column names.
  - `nullValue="?"`: Specifies that "?" is considered as null value during parsing.
  - `inferSchema="true"`: Enables automatic schema inference based on data types in the CSV file.
- **DataFrame Creation**: The resulting DataFrame contains the parsed data from the CSV file.

Using parsing options ensures accurate interpretation of the CSV data, including handling of headers, null values, and automatic schema inference.


In [None]:
parsed = spark.read.option("header", "true").option("nullValue", "?").option("inferSchema", "true").csv("data/linkage/donation/block_1/block_1.csv")

## Analyzing Data with the DataFrame API

## Printing Schema and Displaying Data

Printing the schema and displays the first five rows of the DataFrame `parsed`.

- **Printing Schema**: `parsed.printSchema()` prints the schema of the DataFrame.
- **Displaying Data**: `parsed.show(5)` displays the first five rows of the DataFrame.

Printing the schema helps in understanding the structure of the DataFrame, while displaying data provides an initial glimpse into its contents.


In [None]:
parsed.printSchema()

parsed.show(5)

## Counting Rows

Calculating the number of rows in the DataFrame `parsed`.

- **Counting Rows**: `parsed.count()` computes the total number of rows in the DataFrame.

Counting rows provides an overview of the dataset's size, helping to gauge its scale and complexity.

In [None]:
parsed.count()

## Caching DataFrame

Caches the DataFrame `parsed` in memory for faster access.

- **Caching Data**: `parsed.cache()` caches the DataFrame in memory.

Caching is particularly useful for iterative operations or when you need to reuse the DataFrame multiple times within the same computation, as it avoids reevaluation of DataFrame transformations.


In [None]:
parsed.cache()

## Grouping and Aggregating Data

Grouping the DataFrame `parsed` by the column "is_match" and aggregates the counts of each group, then orders the result by count in descending order.

- **Grouping and Aggregating**: `parsed.groupBy("is_match").count()` groups the data by the "is_match" column and calculates the count of each group.
- **Sorting**: `.orderBy(col("count").desc())` sorts the aggregated data in descending order based on the count.
- **Displaying Results**: `.show()` displays the grouped and aggregated data.

This analysis provides insights into the distribution of matches and non-matches in the dataset, which can be crucial for understanding the data characteristics.


In [None]:
from pyspark.sql.functions import col

parsed.groupBy("is_match").count().orderBy(col("count").desc()).show()

## Creating Temporary View

Creating a temporary view named "linkage" for the DataFrame `parsed`.

- **Creating Temporary View**: `parsed.createOrReplaceTempView("linkage")` creates a temporary view in the Spark SQL context.

Creating a temporary view allows you to query the DataFrame using SQL syntax, enabling more flexible and expressive data analysis.


In [None]:
parsed.createOrReplaceTempView("linkage")

## Execute SQL query to group by 'is_match' and count occurrences, then order by count in descending order

## Executing SQL Query

Executing an SQL query using Spark SQL.

- **SQL Query**: The SQL query selects the "is_match" column and counts the occurrences of each value, grouping by "is_match" and ordering the results by count in descending order.
- **Execution**: `spark.sql(""" ... """)` executes the SQL query using Spark SQL.
- **Displaying Results**: `.show()` displays the results of the SQL query.

Using SQL queries with Spark SQL provides a familiar and powerful interface for data analysis, especially for users comfortable with SQL syntax.

In [None]:
spark.sql("""
  SELECT is_match, COUNT(*) cnt
  FROM linkage
  GROUP BY is_match
  ORDER BY cnt DESC
""").show()

## Fast Summary Statistics, Plotting and Reshaping DataFrames

## Fast Summary Statistics for DataFrames

**Generating a summary statistics DataFrame `summary` by describing the parsed DataFrame `parsed`. It computes basic statistics for each numerical column in the DataFrame, including count, mean, standard deviation, min, max, and quartiles.**


In [None]:
summary = parsed.describe()

**Selecting specific columns ('summary', 'cmp_fname_c1', 'cmp_fname_c2') from the `summary` DataFrame and displays the result. It likely shows summary statistics related to the columns 'cmp_fname_c1' and 'cmp_fname_c2'.**


In [None]:
summary.select("summary", "cmp_fname_c1", "cmp_fname_c2").show()

**Filtering the `parsed` DataFrame into two separate DataFrames: `matches`, which contains rows where the 'is_match' column is true, and `misses`, which contains rows where the 'is_match' column is false. Then, it computes summary statistics for both DataFrames (`match_summary` and `miss_summary`) using the `describe()` function.**


In [None]:
matches = parsed.where("is_match = true")
match_summary = matches.describe()

misses = parsed.filter(col("is_match") == False)
miss_summary = misses.describe()

## Pivoting and Reshaping DataFrames

**Converting the Spark DataFrame `summary` to a Pandas DataFrame named `summary_p`.**


In [None]:
summary_p = summary.toPandas()

**Displaying the first few rows of the Pandas DataFrame `summary_p` using the `head()` method and outputs the shape of the DataFrame using the `shape` attribute.**


In [None]:
summary_p.head()

summary_p.shape

**Transforming the Pandas DataFrame `summary_p` by setting the 'summary' column as the index, transposing the DataFrame, resetting the index, and renaming the columns appropriately. Finally, it removes the axis name to clean up the DataFrame. The code then outputs the shape of the transformed DataFrame.**


In [None]:
summary_p = summary_p.set_index('summary').transpose().reset_index()

summary_p = summary_p.rename(columns={'index':'field'})

summary_p = summary_p.rename_axis(None, axis=1)

summary_p.shape

**Creating a Spark DataFrame named `summaryT` from the Pandas DataFrame `summary_p`.**

In [None]:
summaryT = spark.createDataFrame(summary_p)

summaryT

**Printing the schema of the Spark DataFrame `summaryT`, displaying the data types and nullable properties of each column.**


In [None]:
 summaryT.printSchema()

**Iterating through the columns of the Spark DataFrame `summaryT` and converts all columns except 'field' to DoubleType using the `cast()` method. Afterwards, it prints the schema of the updated DataFrame to reflect the changes.**

In [None]:
from pyspark.sql.types import DoubleType
for c in summaryT.columns:
  if c == 'field':
    continue
  summaryT = summaryT.withColumn(c, summaryT[c].cast(DoubleType()))

summaryT.printSchema()

**Defining a function `pivot_summary(desc)` to pivot summary statistics DataFrame. It converts the input DataFrame `desc` to a Pandas DataFrame, transposes it, resets the index, renames columns, and converts it back to a Spark DataFrame. Then, it converts metric columns to DoubleType from String in the Spark DataFrame. Finally, it returns the transformed Spark DataFrame.**


In [None]:
from pyspark.sql import DataFrame
from pyspark.sql.types import DoubleType

def pivot_summary(desc):
  # convert to pandas dataframe
  desc_p = desc.toPandas()
  # transpose
  desc_p = desc_p.set_index('summary').transpose().reset_index()
  desc_p = desc_p.rename(columns={'index':'field'})
  desc_p = desc_p.rename_axis(None, axis=1)
  # convert to Spark dataframe
  descT = spark.createDataFrame(desc_p)
  # convert metric columns to double from string
  for c in descT.columns:
    if c == 'field':
      continue
    else:
      descT = descT.withColumn(c, descT[c].cast(DoubleType()))
    return descT

**This code applies the `pivot_summary()` function to the `match_summary` and `miss_summary` DataFrames, resulting in `match_summaryT` and `miss_summaryT` Spark DataFrames with pivoted summary statistics.**


In [None]:
match_summaryT = pivot_summary(match_summary)
miss_summaryT = pivot_summary(miss_summary)

## Joining DataFrames and Selecting Features

**This code creates temporary views "match_desc" and "miss_desc" for the Spark DataFrames `match_summaryT` and `miss_summaryT`, respectively. Then, it executes a SQL query to join these views on the 'field' column and selects fields that are not "id_1" or "id_2". It calculates the total count for each field by adding the counts from both match and miss dataframes and computes the difference in mean between match and miss dataframes. Finally, it orders the result by delta (difference in mean) in descending order, and then by total count in descending order.**


In [None]:
match_summaryT.createOrReplaceTempView("match_desc")
miss_summaryT.createOrReplaceTempView("miss_desc")
spark.sql("""
  SELECT a.field, a.count + b.count total, a.mean - b.mean delta
  FROM match_desc a INNER JOIN miss_desc b ON a.field = b.field
  WHERE a.field NOT IN ("id_1", "id_2")
  ORDER BY delta DESC, total DESC
""")

## Scoring and Model Evaluation

**This code creates a string `sum_expression` by joining the elements of the list `good_features` with the '+' operator. Each element in the list represents a feature, and the resulting string is the sum of these features.**


In [None]:
good_features = ["cmp_lname_c1", "cmp_plz", "cmp_by", "cmp_bd", "cmp_bm"]

sum_expression = " + ".join(good_features)

sum_expression

**This code fills null values in the columns specified in the `good_features` list with 0, computes the score by summing up the values of these columns, and selects the 'score' and 'is_match' columns. Finally, it displays the resulting DataFrame `scored`.**


In [None]:
from pyspark.sql.functions import expr
scored = parsed.fillna(0, subset=good_features).withColumn('score', expr(sum_expression)).select('score', 'is_match')

scored.show()

**This code defines a function `crossTabs(scored: DataFrame, t: DoubleType) -> DataFrame` to generate a cross-tabulation DataFrame based on a given threshold `t`. It selects records where the score is greater than or equal to the threshold (`score >= t`) and groups them by the binary classification ('above' or 'below' the threshold). Then, it pivots the 'is_match' column to create columns for 'true' and 'false' values and counts the occurrences of each combination. Finally, it returns the resulting cross-tabulation DataFrame.**


In [None]:
def crossTabs(scored: DataFrame, t: DoubleType) -> DataFrame:
  return scored.selectExpr(f"score >= {t} as above", "is_match").groupBy("above").pivot("is_match", ("true", "false")).count()

**This code generates a cross-tabulation DataFrame by calling the `crossTabs()` function with the DataFrame `scored` and a threshold value of 4.0. It then displays the resulting DataFrame.**


In [None]:
crossTabs(scored, 4.0).show()

**This code generates a cross-tabulation DataFrame by calling the `crossTabs()` function with the DataFrame `scored` and a threshold value of 2.0. It then displays the resulting DataFrame.**


In [None]:
crossTabs(scored, 2.0).show()

**Calculate precision, recall, and F1-score from true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).**


In [None]:
TP=cm1.filter("above==true").select("true").collect()[0].true
TN=cm1.filter("above==true").select("false").collect()[0].false
FP=cm1.filter("above==false").select("true").collect()[0].true
FN=cm1.filter("above==false").select("false").collect()[0].false

precision = TP/(TP + FP)
recall = TP/(TP + FN)
f1score = 2*precision*recall/(precision+recall)

print(f"Precision->{precision}\nRecall->{recall}\nF1-Score->{f1score}")

Precision->0.99713330148 <br>
Recall->0.00363056914868 <br>
F1-Score->0.007234796354 <br>