## Visualization

### 1.What percentage of customers in the dataset have experienced attrition?

In [None]:
import matplotlib.pyplot as plt

# Count the occurrences of each churn status
attrition_counts = Bank['Attrition_Flag'].value_counts()

# Define custom colors (purple and dark blue)
colors = ['#800080', '#00008B']  # Purple and dark blue

# Create a pie chart with custom colors
plt.pie(attrition_counts, labels=attrition_counts.index, autopct='%1.1f%%', startangle=90, wedgeprops=dict(width=0.33), colors=colors)

# Add a hole in the center to make it look like a donut chart
centre_circle = plt.Circle((0, 0), 0.25, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

# Set the title
plt.title('Proportion of Churn vs Not Churn Customers')

# Display the plot
plt.show()

#### Insights:
The percentage of customers who have experienced attrition in the dataset is 16.1%.

### 2)How does the distribution of 'Avg_Utilization_Ratio' differ between attrited and existing customers based on the box plot?

In [None]:
import plotly.express as px
import pandas as pd

# Assuming Bank is your DataFrame
df = pd.DataFrame(Bank)

# Create a box plot with Plotly Express
fig = px.box(df, x='Attrition_Flag', y='Avg_Utilization_Ratio',
             color='Attrition_Flag',
             category_orders={'Attrition_Flag': ['Existing', 'Attrited']},
             title='Distribution of Avg_Utilization_Ratio for Attrited and Existing Customers',
             labels={'Attrition_Flag': 'Attrition Flag', 'Avg_Utilization_Ratio': 'Avg. Utilization Ratio'})

# Show the plot
fig.show()


#### insights: 
 (In existing customer we have more than one outlier ) 

## 3)Are there any income groups that tend to have higher or lower total revolving balances?

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))
sns.barplot(x='Income_Category', y='Total_Revolving_Bal', data=Bank, errorbar=None)
plt.title('Average Total Revolving Balance by Income Category')
plt.xlabel('Income Category')
plt.ylabel('Total_Revolving_Bal')
plt.show()

#### Insights:
customers with income less than 40k have lower total revolving balance while customer with income more than 120k have higher total revolving balance.

## 4)How does the number of dependents vary across different marital statuses?

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt



plt.figure(figsize=(10, 6))
sns.countplot(x='Marital_Status', hue='Dependent_count', data=Bank, palette='viridis')
plt.title('Dependent Count by Marital Status')
plt.xlabel('Marital Status')
plt.ylabel('Count')
plt.show()


##### Insight:
*The count plot reveals the distribution of dependents across different marital statuses. Among married and single individuals, the most prevalent scenario is having 2 or 3 dependents while in divorced individuals have a higher count of 4 dependents*

### 5)Is there a correlation between the type of card a customer holds and their income category?"

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 8))
pd.crosstab(Bank['Income_Category'], Bank['Card_Category'], normalize='index').plot(kind='bar', stacked=True, cmap='coolwarm')
plt.title('Heatmap of Card Categories by Income Category')
plt.xlabel('Income Category')
##'plasma', 'magma', 'inferno', 'cividis','coolwarm'=>heatmap clors
plt.ylabel('Proportion')
plt.show()

##### Insights:
*customers with income 120k+ tend to be the higher category with gold card holder whilewhile customers with 80k-120k income tend to be the only group that have platinium card holders.*

In [None]:
Bank=Bank.head(50)

##  6)What is the relationship between Credit_Limit  and Avg_Open_To_Buy

In [None]:
Bank.plot(x= 'Credit_Limit', y= 'Avg_Open_To_Buy' , kind='scatter', color = 'purple');

#### From the above : 
*there is a strong positive relationship between Credit_Limit and Avg_Open_To_Buy . if one increases the another increases too.*

### 7)What insights can we derive about customer behavior from the box plot of total transactions?

In [None]:
import plotly.express as px ## => interactive visualizations in Plotly(quick and easy).
import pandas as pd

df = pd.DataFrame(Bank)

# Create a figure with subplots (box plot and histogram)
## nbins=> means you are dividing the range of the data into 20 equal 

fig = px.histogram(df, x='Total_Trans_Amt', nbins=20, marginal='box', title='Distribution of Total Transaction Amount (Last 12 months)')

# Update layout
fig.update_layout(
    xaxis_title='Total Trans Amt',
    yaxis_title='Frequency',
    bargap=0.1,
    bargroupgap=0.1
)

# Show the plot
fig.show()


####     insights:         
*Total transaction amount from 1100 to 1199 had the highest frequency*

### 8)How is the correlation between the number of total transactions ('Total_Trans_Ct') and the total transaction amount ('Total_Trans_Amt')?

In [None]:
import plotly.express as px
import pandas as pd


df = pd.DataFrame(Bank)

fig = px.scatter(df, x="Total_Trans_Amt", y="Total_Trans_Ct", animation_frame="Customer_Age",
           size="Credit_Limit", color="Gender", hover_name="Marital_Status",
           log_x=True, size_max=55, range_x=[df["Total_Trans_Amt"].min(), df["Total_Trans_Amt"].max()],
           range_y=[df["Total_Trans_Ct"].min(), df["Total_Trans_Ct"].max()])

fig["layout"].pop("updatemenus")  # optional, drop animation buttons
fig.update_layout(title='Scatter Plot with Animation for Customer Age',
                  xaxis=dict(title='Total_Trans_Amt'),
                  yaxis=dict(title='Total_Trans_Ct'))

fig.show()


### 9) How does the distribution of 'Credit_Limit' vary across different levels of 'Avg_Open_To_Buy?

In [None]:
import plotly.graph_objects as go ##module =>( master building and customize everything)

fig = go.Figure()
fig.add_trace(go.Box(y=Bank['Credit_Limit'], name='Credit_Limit'))
fig.add_trace(go.Box(y=Bank['Avg_Open_To_Buy'], name='Avg_Open_To_Buy'))
fig.show()

### 10)How does the number of inactive months ('Months_Inactive_12_mon') vary over time?

In [None]:
plt.figure(figsize=(12, 6))
plt.plot(Bank['Months_Inactive_12_mon'], marker='o', linestyle='-', color='c')
plt.title('Months Inactive 12 Months Over Time')
plt.xlabel('Index or Time Period')
plt.ylabel('Months Inactive 12 Months')
plt.grid(True)
plt.show()

#### insights:
The longest time period customer inactive was 6 months while The shortest time period customer inactive was 0 months

##### get coulmn name

In [None]:
column_names = Bank.columns

print(column_names)

### 11) What is the Positive Correlation And the Negative Correlation between numerical column?

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Select numerical columns for correlation analysis
numerical_columns = ['Customer_Age', 'Dependent_count', 'Months_on_book', 'Total_Relationship_Count',
                      'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
                      'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt', 'Total_Trans_Ct',
                      'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']

# Subset the DataFrame with numerical columns
df_numerical = Bank[numerical_columns]

# Compute the correlation matrix using Pearson correlation coefficient
corr_matrix = df_numerical.corr()

# Plot the heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, cmap='magma', annot=True, fmt=".2f", linewidths=.5)
plt.title("Correlation Matrix of Numerical Columns")
plt.show()


In [None]:
import pandas as pd

# Select numeric columns
numeric_columns = ['Customer_Age', 'Dependent_count', 'Months_on_book', 'Total_Relationship_Count',
                   'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit',
                   'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1',
                   'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']

# Create a subset DataFrame with numeric columns
numeric_data =[numeric_columns]

# Calculate correlation matrix
correlation_matrix = numeric_data.corr()

# Get columns with positive correlation
positive_correlation = correlation_matrix[correlation_matrix > 0].stack().sort_values(ascending=False)

# Display columns with positive correlation
print("Columns with Positive Correlation:")
print(positive_correlation[positive_correlation < 1])

In [None]:
import pandas as pd

# Select numeric columns
numeric_columns = ['Customer_Age', 'Dependent_count', 'Months_on_book', 'Total_Relationship_Count',
                   'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit',
                   'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1',
                   'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']

# Create a subset DataFrame with numeric columns
numeric_data = Bank[numeric_columns]

# Calculate correlation matrix
correlation_matrix = numeric_data.corr()

# Get columns with negative correlation
negative_correlation = correlation_matrix[correlation_matrix < 0].stack().sort_values(ascending=True)

# Display columns with negative correlation
print("Columns with Negative Correlation:")
print(negative_correlation[negative_correlation < -0.1])  # Adjust the threshold as needed

### 12)What is the relationship between Total_Trans_Amt and Total_Trans_Ct

In [None]:
import plotly.graph_objects as go


fig = go.Figure()

# Scatter plot
fig.add_trace(go.Scatter(
    x=Bank['Avg_Open_To_Buy'],
    y=Bank['Avg_Utilization_Ratio'],
    mode='markers',
    marker=dict(color='red')
))

# Update layout
fig.update_layout(
    title='Scatter Plot of Avg_Open_To_Buy vs Avg_Utilization_Ratio',
    xaxis=dict(title='Avg_Open_To_Buy'),
    yaxis=dict(title='Avg_Utilization_Ratio'),
)

fig.show()

#### From the above :
*there is a  negative relationship between Avg_Open_To_Buy and Avg_Utilization_Ratio*

### 13)How is the distribution of a specific variable count related to the age of customers?

In [None]:
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Assuming df is your DataFrame with a column named 'Your_Numeric_Column'
# Replace 'Your_Numeric_Column' with the actual column name in your dataset
df = pd.DataFrame(Bank)
# Plotting the histogram for your numeric column
sns.histplot(data=df['Customer_Age'], kde=True)
plt.title('Histogram of Your Numeric Column')
plt.xlabel('Customer_Age')
plt.ylabel('Count')
plt.show()

#### Insight :
*customers in the age range of 40 to 60 constitute the majority of users of the bank.*

### 14 )What is the relationship between Total_Trans_Amt  and Total_Trans_Ct

In [None]:
Bank.plot(x= 'Total_Trans_Amt', y= 'Total_Trans_Ct' , kind='scatter', color = 'red');

#### From the above :
*there is a  positive relationship between Total_Trans_Amt  and Total_Trans_Ct*

### 15)What is the distribution of card categories among customers with different education levels?

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(x='Education_Level', hue='Card_Category', data=df)
plt.show()

#### Insights : 
*the blue card  is the most used card in all Education_Level.and the gold card have beenused in high school and graduate level only.*

### 16) How does the distribution of customer age vary across different income categories?

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming df is your DataFrame with columns 'Customer_Age' and 'Income_Category'
# You can replace this with your actual dataset

df = pd.DataFrame(Bank)

# Create individual histograms for 'Customer_Age' with hue
sns.histplot(data=df, x='Customer_Age', hue='Income_Category', multiple="stack", palette='viridis', bins=10)

# Add title and labels
plt.title('Histogram of Customer Age by Income Category')
plt.xlabel('Customer Age')
plt.ylabel('Count')

# Show the plot
plt.show()

### Insight: 
*All customers, regardless of age, fall within an income category below 40K.*

## Machine Learnning model

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier##Algorithm
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
import matplotlib.pyplot as plt

# Load the dataset
bank2 = pd.read_csv(r'C:\Users\abdel\OneDrive\Desktop\bank(4).csv')

# One-hot encode the 'Attrition_Flag' column
att = pd.get_dummies(bank2['Attrition_Flag'], drop_first=True)
bank2 = bank2.merge(att, how='left', left_index=True, right_index=True)

# Define features (X) and target variable (y)
X = bank2[['Total_Relationship_Count', 'Months_Inactive_12_mon',
           'Contacts_Count_12_mon', 'Total_Revolving_Bal', 'Dependent_count',
           'Total_Trans_Amt', 'Total_Trans_Ct', 'Avg_Utilization_Ratio', 'Total_Ct_Chng_Q4_Q1']]
y = bank2['Attrition_Flag']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Train the Random Forest classifier
rf = RandomForestClassifier(random_state=0)
rf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf.predict(X_test)

# Calculate confusion matrix and print classification report
cm = confusion_matrix(y_test, y_pred)
accuracy = (cm[0, 0] + cm[1, 1]) / cm.sum()
print(classification_report(y_test, y_pred))
print(f"The Accuracy of Random Forest is {accuracy * 100}")

# Visualize the confusion matrix using Seaborn
sns.heatmap(cm, annot=True, fmt='g', cmap='Purples', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()