![](https://www.techproeducation.com/logo/headerlogo.svg)

![image.png](attachment:image.png)

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

# INTRODUCTION

<p>Kan nakli, büyük bir ameliyat veya ciddi bir yaralanma sırasında kaybedilen kanın yerine konulmasından çeşitli hastalıkların ve kan bozukluklarının tedavisine kadar hayat kurtarır. İhtiyaç duyulduğunda yeterli kanın bulunmasını sağlamak sağlık profesyonelleri için ciddi bir zorluktur. ''.  According to <a href="https://www.webmd.com/a-to-z-guides/blood-transfusion-what-to-know#1">WebMD</a>'ye göre "her yıl yaklaşık 5 milyon Amerikalının kan nakline ihtiyacı var".</p>

<p>Veri setimiz Tayvan'daki mobil kan bağışı aracından alınmıştır.</p>

<p>Veriler datasets/transfusion.data'da saklanır ve RFMTC pazarlama modeline (RFM'nin bir çeşidi) göre yapılandırılmıştır.   

RFM genellikle müşteri segmentasyonu için kullanılır ve müşterilerin ne zaman (Recency), ne sıklıkla (Frequency), ve ne kadar para harcadığı (Monetary) gibi özellikler üzerinden analiz yapılmasını sağlar. 

Bu özellikler genellikle müşteri yaşam değeri modellemesi, churn tahmini, müşteri segmentasyonu gibi konulara uyarlanır. 

Ancak burada, bu özellikler bir sosyal iyilik konusu olan kan bağışı için kullanılmıştır.

Bu veri setinde:

**RFMTC Bileşenleri**

1. **Recency (R) - Yakınlık  — "Recency (months)"**
    - Bu özellik, bir bağışçının son bağışından bu yana ne kadar süre geçtiğini temsil eder. Genellikle, son bağışı daha yakın olan bağışçıların tekrar bağış yapma olasılığı daha yüksektir. 
  
2. **Frequency (F) - Sıklık — "Frequency (times)"**
    - Bu, bir bağışçının ne sıklıkla kan bağışı yaptığını gösterir. Yüksek sıklıkta bağış yapan kişilerin, gelecekte de bağış yapma olasılığı genellikle daha yüksektir.
  
3. **Monetary (M) - Parasal Değer — "Monetary (c.c. blood)"**
    - Bu özellik, bağışçının toplamda ne kadar kan bağışladığını temsil eder. Genellikle, daha yüksek miktarda kan bağışlayan bağışçıların değeri daha yüksek kabul edilir.

4. **Time (T) - Zaman — "Time (months)"**
    - Bu, bir bağışçının ilk bağışından bu yana ne kadar süre geçtiğini gösterir. Bu özellik, bağışçının bağış yapma süresi boyunca ne kadar "sadık" olduğunu anlamak için kullanılabilir.

5. **Churn (C) - Ayrılma — "whether he/she donated blood in March 2007"**
    - Bu, bir bağışçının belirli bir zaman diliminde (Mart 2007'de) bağış yapılıp yapmadığını gösterir. Churn, bu örnekte bağışçının o dönemde bağış yapmama olasılığını temsil eder.

**RFMTC'nin Kullanım Alanları**

1. **Segmentasyon**: Bağışçılar, bu özellikler kullanılarak farklı segmentlere ayrılabilir. Örneğin, yüksek "F" ve düşük "R" değerine sahip bağışçılar "Sadık Bağışçılar" olarak etiketlenebilir.

2. **Tahminleme**: Gelecekteki bağış olasılığı, mevcut RFMTC değerleri kullanılarak tahmin edilebilir.

3. **Hedefleme**: Özel kampanyalar veya teşvikler, belirli bağışçı segmentlerini hedeflemek için kullanılabilir.

4. **Risk Analizi**: Düşük sıklıkta ve yüksek churn oranına sahip bağışçılar "Riskli" olarak etiketlenebilir, ve bu bağışçılara yönelik özel stratejiler geliştirilebilir.

Bu modelleme tekniği, bağışçıların gelecekteki davranışlarını anlamak ve onları daha etkin bir şekilde yönetmek için oldukça kullanışlıdır.

Bağışçıların gelecekte kan bağışı yapma olasılığını modellemek için kullanılabilir.

# Importing Libraries

In [18]:
import numpy as np
import pandas as pd

# Exploratory Data Analysis and Visualization

In [19]:
df = pd.read_csv('transfusion.data')
df

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0
...,...,...,...,...,...
743,23,2,500,38,0
744,21,2,500,52,0
745,23,3,750,62,0
746,39,1,250,39,0


## Change the column names if necessary

In [21]:
new_column_names = {
    'Recency (months)': 'Recency',
    'Frequency (times)': 'Frequency',
    'Monetary (c.c. blood)': 'Monetary',
    'Time (months)': 'Time',
    'whether he/she donated blood in March 2007': 'Target'
                   }
            
df.rename(columns=new_column_names, inplace=True)

In [22]:
df

Unnamed: 0,Recency,Frequency,Monetary,Time,Target
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0
...,...,...,...,...,...
743,23,2,500,38,0
744,21,2,500,52,0
745,23,3,750,62,0
746,39,1,250,39,0


## Get the first 5 lines

In [23]:
df.head(5)

Unnamed: 0,Recency,Frequency,Monetary,Time,Target
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


## Look at the general information

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   Recency    748 non-null    int64
 1   Frequency  748 non-null    int64
 2   Monetary   748 non-null    int64
 3   Time       748 non-null    int64
 4   Target     748 non-null    int64
dtypes: int64(5)
memory usage: 29.3 KB


## Look at the shape

In [25]:
np.shape(df)

(748, 5)

In [31]:
df.shape

(748, 5)

## Check for missing values

In [26]:
df.isnull()

Unnamed: 0,Recency,Frequency,Monetary,Time,Target
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
...,...,...,...,...,...
743,False,False,False,False,False
744,False,False,False,False,False
745,False,False,False,False,False
746,False,False,False,False,False


In [33]:
df.isna().sum()

Recency      0
Frequency    0
Monetary     0
Time         0
Target       0
dtype: int64

In [34]:
df.isna().sum().any()

False

In [27]:
df.notnull()

Unnamed: 0,Recency,Frequency,Monetary,Time,Target
0,True,True,True,True,True
1,True,True,True,True,True
2,True,True,True,True,True
3,True,True,True,True,True
4,True,True,True,True,True
...,...,...,...,...,...
743,True,True,True,True,True
744,True,True,True,True,True
745,True,True,True,True,True
746,True,True,True,True,True


In [35]:
df.notnull().sum()

Recency      748
Frequency    748
Monetary     748
Time         748
Target       748
dtype: int64

## Check for duplicated values

In [37]:
df.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
743    False
744    False
745    False
746    False
747    False
Length: 748, dtype: bool

In [38]:
df.duplicated().sum()

215

## Check the dtype

In [28]:
df.dtypes

Recency      int64
Frequency    int64
Monetary     int64
Time         int64
Target       int64
dtype: object

## Calculate the basic statistical values

In [30]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Recency,748.0,9.506684,8.095396,0.0,2.75,7.0,14.0,74.0
Frequency,748.0,5.514706,5.839307,1.0,2.0,4.0,7.0,50.0
Monetary,748.0,1378.676471,1459.826781,250.0,500.0,1000.0,1750.0,12500.0
Time,748.0,34.282086,24.376714,2.0,16.0,28.0,50.0,98.0
Target,748.0,0.237968,0.426124,0.0,0.0,0.0,0.0,1.0


## Check unique values

In [39]:
df.Recency.unique()

array([ 2,  0,  1,  4,  5,  9,  3, 12,  6, 11, 10, 13,  8, 14,  7, 16, 15,
       23, 21, 18, 22, 26, 35, 38, 40, 74, 20, 17, 25, 39, 72],
      dtype=int64)

In [40]:
df.Frequency.unique()

array([50, 13, 16, 20, 24,  4,  7, 12,  9, 46, 23,  3, 10,  6,  5, 14, 15,
       11,  8,  2, 19, 17,  1, 22, 18, 38, 43, 34, 44, 26, 41, 21, 33],
      dtype=int64)

In [42]:
df.Time.unique()

array([98, 28, 35, 45, 77,  4, 14, 22, 58, 47, 15, 11, 48, 49, 16, 40, 34,
       21, 26, 64, 57, 53, 69, 36,  2, 46, 52, 81, 29,  9, 74, 25, 51, 71,
       23, 86, 38, 76, 70, 59, 82, 61, 79, 41, 33, 10, 95, 88, 19, 37, 39,
       78, 42, 27, 24, 63, 43, 75, 73, 50, 60, 17, 72, 62, 30, 31, 65, 89,
       87, 93, 83, 32, 12, 18, 55,  3, 13, 54], dtype=int64)

In [43]:
df.Monetary.unique()

array([12500,  3250,  4000,  5000,  6000,  1000,  1750,  3000,  2250,
       11500,  5750,   750,  2500,  1500,  1250,  3500,  3750,  2750,
        2000,   500,  4750,  4250,   250,  5500,  4500,  9500, 10750,
        8500, 11000,  6500, 10250,  5250,  8250], dtype=int64)

In [41]:
df.Target.unique()

array([1, 0], dtype=int64)

In [44]:
unique_count=df.nunique()
unique_count

Recency      31
Frequency    33
Monetary     33
Time         78
Target        2
dtype: int64

In [45]:
df.Target.nunique()

2

## Calculate the average of 'Recency'

In [46]:
df["Recency"].mean()

9.506684491978609

## Find the highest value in 'Frequency'

In [48]:
df.Frequency.max()

50

## Calculate the median of 'Time'

In [49]:
df.Time.median()   # orta deger

28.0

## Calculate the standard deviation of 'Monetary'

In [50]:
df.Monetary.std()

1459.826780772503

## Count the number of unique values in 'Time'

In [51]:
df.Time.nunique()

78

## Calculate the ratio of donors in March 2007 (Target=1) to total donors

In [54]:
df.Target.value_counts()

0    570
1    178
Name: Target, dtype: int64

In [56]:
df.Target.value_counts()[1] / len(df)

0.23796791443850268

In [52]:
df.Target.sum()/df.Target.count()

0.23796791443850268

In [58]:
df.Target.value_counts(normalize = True)

0    0.762032
1    0.237968
Name: Target, dtype: float64

## Filter donors with 'Recency' less than 10 months

In [59]:
df.Recency.sample(5)

234    14
291    16
720    21
57      2
381    14
Name: Recency, dtype: int64

In [60]:
df["Recency"] < 10

0       True
1       True
2       True
3       True
4       True
       ...  
743    False
744    False
745    False
746    False
747    False
Name: Recency, Length: 748, dtype: bool

In [61]:
df[df["Recency"] < 10]

Unnamed: 0,Recency,Frequency,Monetary,Time,Target
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0
...,...,...,...,...,...
653,4,2,500,30,0
656,4,2,500,31,0
669,2,3,750,75,1
670,2,3,750,77,0


In [62]:
len(df[df["Recency"] < 10])

401

In [63]:
df[df.Recency < 10].count()  #cengiz beyin cozumu

Recency      401
Frequency    401
Monetary     401
Time         401
Target       401
dtype: int64

In [64]:
df.query("Recency< 10")  # elif hanimin cozumu

Unnamed: 0,Recency,Frequency,Monetary,Time,Target
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0
...,...,...,...,...,...
653,4,2,500,30,0
656,4,2,500,31,0
669,2,3,750,75,1
670,2,3,750,77,0


## Select donors who donated at least 5 times

In [67]:
df[df["Frequency"] >= 5]

Unnamed: 0,Recency,Frequency,Monetary,Time,Target
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0
...,...,...,...,...,...
713,16,6,1500,81,0
715,16,5,1250,71,0
719,23,8,2000,69,0
726,25,6,1500,50,0


## Create a new column giving the time between the first donation and the last donation

In [68]:
df["Donation_period"] = df["Time"] - df["Recency"]
df["Donation_period"] 

0      96
1      28
2      34
3      43
4      76
       ..
743    15
744    31
745    39
746     0
747     0
Name: Donation_period, Length: 748, dtype: int64

## Outlier Analysis for 'Monetary'

In [80]:
from scipy import stats

In [92]:
z_scores = np.abs(stats.zscore(df["Monetary"]))
outliers = np.where(z_scores > 3)[0]

In [93]:
z_scores

0      7.623346
1      1.282738
2      1.796842
3      2.482313
4      3.167784
         ...   
743    0.602307
744    0.602307
745    0.430940
746    0.773675
747    0.773675
Name: Monetary, Length: 748, dtype: float64

In [94]:
outliers

array([  0,   4,   9, 115, 341, 500, 502, 503, 504, 505, 517, 528],
      dtype=int64)

In [95]:
df.iloc[outliers]

Unnamed: 0,Recency,Frequency,Monetary,Time,Target,Donation_period
0,2,50,12500,98,1,96
4,1,24,6000,77,0,76
9,5,46,11500,98,1,93
115,11,24,6000,64,0,53
341,23,38,9500,98,0,75
500,2,43,10750,86,1,84
502,2,34,8500,77,1,75
503,2,44,11000,98,0,96
504,0,26,6500,76,1,76
505,2,41,10250,98,1,96


## Create a simple scoring model based on 'Recency' and 'Frequency'

In [98]:
df["Donor_Score"] = (1 / df["Recency"]) + df["Frequency"]
df["Donor_Score"]

0      50.500000
1            inf
2      17.000000
3      20.500000
4      25.000000
         ...    
743     2.043478
744     2.047619
745     3.043478
746     1.025641
747     1.013889
Name: Donor_Score, Length: 748, dtype: float64

In [114]:
df["Donor_Score"].sort_values(ascending=0)

106         inf
504         inf
67          inf
1           inf
11          inf
         ...   
497    1.026316
746    1.025641
498    1.025000
747    1.013889
499    1.013514
Name: Donor_Score, Length: 748, dtype: float64

In [100]:
df["Donor_Score1"] = np.where(df["Recency"] == 0, df["Frequency"], (1 / df["Recency"]) + df["Frequency"])
df["Donor_Score1"] 

0      50.500000
1      13.000000
2      17.000000
3      20.500000
4      25.000000
         ...    
743     2.043478
744     2.047619
745     3.043478
746     1.025641
747     1.013889
Name: Donor_Score1, Length: 748, dtype: float64

In [103]:
df.head(10)

Unnamed: 0,Recency,Frequency,Monetary,Time,Target,Donation_period,Donor_Score,Donor_Score1
0,2,50,12500,98,1,96,50.5,50.5
1,0,13,3250,28,1,28,inf,13.0
2,1,16,4000,35,1,34,17.0,17.0
3,2,20,5000,45,1,43,20.5,20.5
4,1,24,6000,77,0,76,25.0,25.0
5,4,4,1000,4,0,0,4.25,4.25
6,2,7,1750,14,1,12,7.5,7.5
7,1,12,3000,35,0,34,13.0,13.0
8,2,9,2250,22,1,20,9.5,9.5
9,5,46,11500,98,1,93,46.2,46.2


## Convert Time to Years and Months (Time Series Transformation)

In [106]:
df["Years"] = df["Time"] // 12 # df["Years"] = df["Time"] /12 bu sekilde pek anlasilir degil
df["Years"] 

0      8
1      2
2      2
3      3
4      6
      ..
743    3
744    4
745    5
746    3
747    6
Name: Years, Length: 748, dtype: int64

In [107]:
df["Month"] = df["Time"] % 12

In [109]:
df.head()

Unnamed: 0,Recency,Frequency,Monetary,Time,Target,Donation_period,Donor_Score,Donor_Score1,Years,Month
0,2,50,12500,98,1,96,50.5,50.5,8,2
1,0,13,3250,28,1,28,inf,13.0,2,4
2,1,16,4000,35,1,34,17.0,17.0,2,11
3,2,20,5000,45,1,43,20.5,20.5,3,9
4,1,24,6000,77,0,76,25.0,25.0,6,5


## Calculate the correlation of 'Target' with other features (Correlation Analysis)

In [112]:
df.corr()["Target"].sort_values(ascending=0)

Target             1.000000
Donor_Score1       0.225380
Monetary           0.218633
Frequency          0.218633
Donor_Score        0.216745
Donation_period    0.056986
Month             -0.021089
Years             -0.032680
Time              -0.035854
Recency           -0.279869
Name: Target, dtype: float64

## Create donor groups based on 'Frequency' (Grouping and Aggregation)

In [116]:
bins = [0, 5, 10, 50]
groups_names = ["low", "medium", "High"]

df["Frequency_Group"] = pd.cut(df["Frequency"], bins, labels = groups_names)

df.sample(10)

Unnamed: 0,Recency,Frequency,Monetary,Time,Target,Donation_period,Donor_Score,Donor_Score1,Years,Month,Frequency_Group
57,2,7,1750,28,1,26,7.5,7.5,2,4,medium
292,11,4,1000,28,0,17,4.090909,4.090909,2,4,low
288,14,5,1250,28,1,14,5.071429,5.071429,2,4,low
444,23,12,3000,86,0,63,12.043478,12.043478,7,2,High
657,14,8,2000,50,0,36,8.071429,8.071429,4,2,medium
450,23,3,750,33,0,10,3.043478,3.043478,2,9,low
296,14,5,1250,28,0,14,5.071429,5.071429,2,4,low
547,2,3,750,11,0,9,3.5,3.5,0,11,low
142,4,3,750,16,0,12,3.25,3.25,1,4,low
108,2,3,750,14,0,12,3.5,3.5,1,2,low


In [118]:
df.groupby(["Frequency_Group"])["Monetary"].mean()

Frequency_Group
low        624.220374
medium    1855.182927
High      4143.203883
Name: Monetary, dtype: float64

## Create a new categorical variable based on 'Recency'

In [129]:
bins = [0, 12, 24, 36, 74]
groups_names = ["0-12 Month", "13-24 Month", "25-36 Month", "37-74 Month"]

df["Recency_Categorical"] = pd.cut(df["Recency"], bins, labels = groups_names)

df.head()

Unnamed: 0,Recency,Frequency,Monetary,Time,Target,Donation_period,Donor_Score,Donor_Score1,Years,Month,Frequency_Group,Recency_Categorical
0,2,50,12500,98,1,96,50.5,50.5,8,2,High,0-12 Month
1,0,13,3250,28,1,28,inf,13.0,2,4,High,
2,1,16,4000,35,1,34,17.0,17.0,2,11,High,0-12 Month
3,2,20,5000,45,1,43,20.5,20.5,3,9,High,0-12 Month
4,1,24,6000,77,0,76,25.0,25.0,6,5,High,0-12 Month


In [130]:
bins = [0, 12, 24,36, 74]       # hocanin kodu
group_names = ["0-12 Month", "13-24 Month", "25-36 Month", "37-74 Month"]
df["Recency_Categorical"] = pd.cut(df["Recency"], bins, labels= group_names)
df.head(10)

Unnamed: 0,Recency,Frequency,Monetary,Time,Target,Donation_period,Donor_Score,Donor_Score1,Years,Month,Frequency_Group,Recency_Categorical
0,2,50,12500,98,1,96,50.5,50.5,8,2,High,0-12 Month
1,0,13,3250,28,1,28,inf,13.0,2,4,High,
2,1,16,4000,35,1,34,17.0,17.0,2,11,High,0-12 Month
3,2,20,5000,45,1,43,20.5,20.5,3,9,High,0-12 Month
4,1,24,6000,77,0,76,25.0,25.0,6,5,High,0-12 Month
5,4,4,1000,4,0,0,4.25,4.25,0,4,low,0-12 Month
6,2,7,1750,14,1,12,7.5,7.5,1,2,medium,0-12 Month
7,1,12,3000,35,0,34,13.0,13.0,2,11,High,0-12 Month
8,2,9,2250,22,1,20,9.5,9.5,1,10,medium,0-12 Month
9,5,46,11500,98,1,93,46.2,46.2,8,2,High,0-12 Month


## Check the distribution of the 'Target' variable

In [128]:
df.Target.value_counts(normalize = 1).round(3)

0    0.762
1    0.238
Name: Target, dtype: float64

In [127]:
df.Target.value_counts(normalize = 1).round(4)

0    0.762
1    0.238
Name: Target, dtype: float64

# BONUS

## Feature Analysis

In [131]:
output_data = []

for col in df.columns:
    
    # If the number of unique values in the column is less than or equal to 5
    if df.loc[:, col].nunique() <= 5:
        # Get the unique values in the column
        unique_values = df.loc[:, col].unique()
        # Append the column name, number of unique values, unique values, and data type to the output data
        output_data.append([col, df.loc[:, col].nunique(), unique_values, df.loc[:, col].dtype])
    else:
        # Otherwise, append only the column name, number of unique values, and data type to the output data
        output_data.append([col, df.loc[:, col].nunique(),"-", df.loc[:, col].dtype])

output_df = pd.DataFrame(output_data, columns=['Column Name', 'Number of Unique Values', ' Unique Values ', 'Data Type'])

output_df

Unnamed: 0,Column Name,Number of Unique Values,Unique Values,Data Type
0,Recency,31,-,int64
1,Frequency,33,-,int64
2,Monetary,33,-,int64
3,Time,78,-,int64
4,Target,2,"[1, 0]",int64
5,Donation_period,90,-,int64
6,Donor_Score,184,-,float64
7,Donor_Score1,186,-,float64
8,Years,9,-,int64
9,Month,12,-,int64


## Classify DataFrame Columns into Categorical and Numeric Types

In [132]:
def grab_col_names(dataframe, cat_th=10):

    # cat_cols
    cat_cols = [col for col in dataframe.columns if dataframe[col].dtypes == "O"]
    num_but_cat = [col for col in dataframe.columns if dataframe[col].nunique() < cat_th and
                   dataframe[col].dtypes != "O"]    
    cat_cols = cat_cols + num_but_cat
    

    # num_cols
    num_cols = [col for col in dataframe.columns if dataframe[col].dtypes != "O"]
    num_cols = [col for col in num_cols if col not in cat_cols]

    print(f"Features: {dataframe.shape[1]}")
    print(f'Number of Categorical Features: {len(cat_cols)}')
    print(f'Number of Numeric Features: {len(num_cols)}')
    print(f"Categorical Features: {cat_cols}") 
    print(f"Numeric Features: {num_cols}")
    
    return cat_cols, num_cols

In [133]:
cat_cols, num_cols = grab_col_names(df)

Features: 12
Number of Categorical Features: 4
Number of Numeric Features: 8
Categorical Features: ['Target', 'Years', 'Frequency_Group', 'Recency_Categorical']
Numeric Features: ['Recency', 'Frequency', 'Monetary', 'Time', 'Donation_period', 'Donor_Score', 'Donor_Score1', 'Month']


## DataFrame Summary Statistics

In [134]:
def summary(df, pred=None):
    obs = df.shape[0]
    Types = df.dtypes
    Counts = df.apply(lambda x: x.count())
    Min = df.min()
    Max = df.max()
    Uniques = df.apply(lambda x: x.unique().shape[0])
    Nulls = df.apply(lambda x: x.isnull().sum())
    print('Data shape:', df.shape)

    if pred is None:
        cols = ['Types', 'Counts', 'Uniques', 'Nulls', 'Min', 'Max']
        str = pd.concat([Types, Counts, Uniques, Nulls, Min, Max], axis = 1, sort=True)

    str.columns = cols
    print('___________________________\nData Types:')
    print(str.Types.value_counts())
    print('___________________________')
    return str

summary(df)

Data shape: (748, 12)
___________________________
Data Types:
int64       8
float64     2
category    1
category    1
Name: Types, dtype: int64
___________________________


Unnamed: 0,Types,Counts,Uniques,Nulls,Min,Max
Donation_period,int64,748,90,0,0,96
Donor_Score,float64,748,184,0,1.013514,inf
Donor_Score1,float64,748,186,0,1.013514,50.5
Frequency,int64,748,33,0,1,50
Frequency_Group,category,748,3,0,low,High
Monetary,int64,748,33,0,250,12500
Month,int64,748,12,0,0,11
Recency,int64,748,31,0,0,74
Recency_Categorical,category,743,5,5,0-12 Month,37-74 Month
Target,int64,748,2,0,0,1


## Check and Remove Duplicate Rows

In [135]:
def duplicate_values(df):
    print("Duplicate check...")
    num_duplicates = df.duplicated(subset=None, keep='first').sum()
    if num_duplicates > 0:
        print("There are", num_duplicates, "duplicated observations in the dataset.")
        df.drop_duplicates(keep='first', inplace=True)
        print(num_duplicates, "duplicates were dropped!")
        print('*' * 100)
    else:
        print("There are no duplicated observations in the dataset.")

In [136]:
duplicate_values(df)

Duplicate check...
There are 215 duplicated observations in the dataset.
215 duplicates were dropped!
****************************************************************************************************


In [137]:
duplicate_values(df)

Duplicate check...
There are no duplicated observations in the dataset.


## Missing Value Analysis in DataFrame

In [138]:
def missing_values(df):
    missing_number = df.isnull().sum().sort_values(ascending = False)
    missing_percent = (df.isnull().sum() / df.isnull().count()).sort_values(ascending = False)
    missing_values = pd.concat([missing_number, missing_percent], axis = 1, keys = ['Missing_Number', 'Missing_Percent'])
    return missing_values[missing_values['Missing_Number'] > 0]

In [139]:
missing_values(df)

Unnamed: 0,Missing_Number,Missing_Percent
Recency_Categorical,5,0.009381


## Column Value Distribution Analysis

In [140]:
def value_cnt(df, column_name):
    vc = df[column_name].value_counts()
    vc_norm = df[column_name].value_counts(normalize=True).round(3)

    vc = vc.rename_axis('workclass').reset_index(name='counts')
    vc_norm = vc_norm.rename_axis('workclass').reset_index(name='norm_counts')

    df_result = pd.concat([vc['workclass'], vc['counts'], vc_norm['norm_counts']], axis=1)
    
    return df_result

In [141]:
# Target görülme sıklığı, bir veri kümesindeki her bir Target değerin vaka sayısı olarak tanımlanır.
# Yani Target sütunda kaç tane 1'e karşılık kaç tane 0 var? 
# Target insidansı bize veri setimizin ne kadar dengeli (veya dengesiz) olduğuna dair bir fikir verir.

value_cnt(df, 'Target')

Unnamed: 0,workclass,counts,norm_counts
0,0,384,0.72
1,1,149,0.28


# <p style="background-color:green;font-family:newtimeroman;font-size:100%;color:white;text-align:center;border-radius:20px 20px;"><b>Faydalı Olması Temennisiyle Teşekkürler</b></p>
![](https://www.techproeducation.com/logo/headerlogo.svg)
<b>Yeniliklerden ilk siz haberdar olmak istiyorsanız lütfen bizi takip etmeyi unutmayın </b>[YouTube](https://www.youtube.com/c/techproeducation) | [Instagram](https://www.instagram.com/techproeducation) | [Facebook](https://www.facebook.com/techproeducation) | [Telegram](https://t.me/joinchat/HH2qRvA-ulh4OWbb) | [Watsapp](https://api.whatsapp.com/send/?phone=%2B15853042959&text&type=phone_number&app_absent=0) | [Linkedin](https://www.linkedin.com/company/techproeducation/mycompany/)