# Blood Transfusion EDA Project

<p>From replacing blood lost during major surgery or a serious injury to treating various diseases and blood disorders, blood transfusions save lives. Ensuring that enough blood is available when it is needed is a serious challenge for health professionals. According to <a href="https://www.webmd.com/a-to-z-guides/blood-transfusion-what-to-know#1">WebMD</a> about 5 million Americans need blood transfusions each year.</p>

<p>The dataset is from a mobile blood donation vehicle in Taiwan.</p>

# **Content**

1. [Introduction](#1.)
1. [Importing Libraries](#2.)
1. [Reading the Dataset](#3.)
1. [Exploratory Data Analysis](#4.)

<a id="1."></a> 
# Introduction

RFM is often used for customer segmentation and enables analysis of attributes such as when (Recency), how often (Frequency), and how much money customers spend (Monetary).

These attributes are often adapted for customer lifetime value modeling, churn prediction, customer segmentation, etc.

However, these features are used for blood donation, which is a social good.

In this dataset:

**RFMTC Components**

1. **Recency (R) — "Recency (months)"**
    - This characteristic represents how long it has been since a donor's last donation. Generally, donors whose last donation was more recent are more likely to donate again.
  
2. **Frequency (F) — "Frequency (times)"**
    - This shows how often a donor donates blood. People who donate more frequently are generally more likely to donate in the future.
  
3. **Monetary (M) — "Monetary (c.c. blood)"**
    - This attribute represents how much blood the donor has donated in total. Generally, donors who donate a higher amount of blood are considered to have a higher value.

4. **Time (T) — "Time (months)"**
    - This shows how long it has been since a donor's first donation. This can be used to understand how "loyal" the donor has been during the donation period.

5. **Churn (C) — "whether he/she donated blood in March 2007"**
    - This shows whether or not a donor donated in a specific period (March 2007). Churn, in this example, represents the probability that the donor did not donate in that period.

**Uses of RFMTC**

1. **Segmentation**: Donors can be categorized into different segments using these attributes. For example, donors with high "F" and low "R" values can be labeled as "Loyal Donors".

2. **Forecasting**: The likelihood of future giving can be predicted using current RFMTC values.

3. **Targeting**: Special campaigns or incentives can be used to target specific donor segments.

4. **Risk Analysis**: Donors with low frequency and high churn rates can be labeled as "Risky", and specific strategies can be developed for these donors.

This modeling technique is useful for understanding the future behavior of donors and managing them more effectively.

It can be used to model the likelihood of donors donating blood in the future.

<a id="2."></a> 
# Importing Libraries

In [4]:
import numpy as np
import pandas as pd

# Exploratory Data Analysis

In [5]:
df = pd.read_csv("transfusion.data")

FileNotFoundError: [Errno 2] No such file or directory: 'transfusion.data'

## Change the column names if necessary

In [None]:
new_column_names = {
    'Recency (months)': 'Recency',
    'Frequency (times)': 'Frequency',
    'Monetary (c.c. blood)': 'Monetary',
    'Time (months)': 'Time',
    'whether he/she donated blood in March 2007': 'Target'
                   }
            
df.rename(columns=new_column_names, inplace=True)

## Get the first 5 lines

In [None]:
df.head(5)

## Look at the general information

In [None]:
df.info()

## Look at the shape

In [None]:
df.shape

## Check for missing values

In [None]:
df.isnull()

In [None]:
df.isnull().sum()

## Check for duplicated values

In [None]:
df[df.duplicated()]

In [None]:
df.duplicated().sum()

## Check the dtype

In [None]:
df.dtypes

## Calculate the basic statistical values

In [None]:
df.describe()

In [None]:
df.describe().T

## Check unique values

In [None]:
df.Recency.unique()

In [None]:
df.Recency.nunique()

In [None]:
df.Target.unique()

In [None]:
df.Target.nunique()

## Calculate the average of 'Recency'

In [None]:
df.Recency.mean()

## Find the highest value in 'Frequency'

In [None]:
df.Frequency.max()

## Calculate the median of 'Time'

In [None]:
df.Time.median()

## Calculate the standard deviation of 'Monetary'

In [None]:
df.Monetary.std()

## Count the number of unique values in 'Time'

In [None]:
df.Time.unique()

In [None]:
df.Time.nunique()

## Calculate the ratio of donors in March 2007 (Target=1) to total donors

In [None]:
df[df["Target"] == 1].count()[0]

In [None]:
df[df["Target"] == 1].count()[0] / len(df)

In [None]:
df.Target.value_counts()

In [None]:
df.Target.value_counts(normalize=True)

## Filter donors with 'Recency' less than 10 months

In [None]:
df[df["Recency"] < 10]

In [None]:
len(df[df["Recency"] < 10])

## Select donors who donated at least 5 times

In [None]:
df[df["Frequency"] >= 5]

In [None]:
len(df[df["Frequency"] >= 5])

## Create a new column giving the time between the first donation and the last donation

In [None]:
df["Donation_Period"] = df.Time - df.Recency
df["Donation_Period"]

In [None]:
df

## Outlier Analysis for 'Frequency'

In [None]:
df.describe().T

In [None]:
q1 = df.Frequency.quantile(0.25)
q1

In [None]:
q3 = df.Frequency.quantile(0.75)
q3

In [None]:
iqr = q3 - q1
iqr

In [None]:
df[df["Frequency"] > 22]

In [None]:
len(df[df["Frequency"] > 22])

In [None]:
len(df[df["Frequency"] > 14.5])

## Create a simple scoring model based on 'Recency' and 'Frequency'

In [None]:
df.head()

In [None]:
df["Danotiopn_Score"] = (1 / df.Recency) + df.Frequency
df["Danotiopn_Score"]

In [None]:
df["Danotiopn_Score_1"] = np.where(df["Recency"] == 0, df.Frequency, (1 / df.Recency) + df.Frequency )
df["Danotiopn_Score_1"]

In [None]:
df.head()

## Convert Time to Years and Months (Time Series Transformation)

In [None]:
 df.Time // 12

In [None]:
df["Years"] = df.Time // 12
df["Years"]

In [None]:
df.Time % 12

In [None]:
df["Months"] = df.Time % 12
df["Months"]

In [None]:
df.head()

## Calculate the correlation of 'Target' with other features (Correlation Analysis)

In [None]:
df.corr()["Target"]

In [None]:
df.corr()["Target"].sort_values(ascending = False)

In [None]:
df.head()

## Create donor groups based on 'Frequency' (Grouping and Aggregation)

In [None]:
df.describe().T

In [None]:
bins = [0, 4, 14, 50 ]

group_names = ["Low", "Medium", "High"]


df["Frequency_Group"] = pd.cut(df.Frequency, bins, labels= group_names)


df["Frequency_Group"]

In [None]:
df.sample(10)

## Create a new categorical variable based on 'Recency'

In [None]:
df.describe().T

In [None]:
bins = [-1, 12, 24, 36, 75 ]

group_names = ["0-12 Months", "13-24 Months", "25-36 Months", "37-74 Months"]


df["Recency_Group"] = pd.cut(df.Recency, bins, labels= group_names)


df["Recency_Group"]

In [None]:
df.head()

## Check the distribution of the 'Target' variable

In [None]:
df.Target.value_counts(normalize=True)

In [None]:
df.Frequency_Group.value_counts()

In [None]:
df.Frequency_Group.value_counts(normalize=True)

In [None]:
df.Frequency_Group.unique()

In [None]:
df.Frequency_Group.nunique()