This dataset represents a customer profiling scenario for a retail or e-commerce company. Each row is a customer.

Here's what each column means:
Column	Real-World Meaning
ID	Unique identifier for each customer (Customer ID 1 to 200)
Age	Age of the customer (used to segment young, middle-aged, senior customers)
Income	Estimated annual income of the customer (used for understanding spending power)
Score	A score from 0 to 100 indicating customer engagement or loyalty — e.g., how actively they use the app, give feedback, or interact with offers
Purchases	Number of purchases made by the customer in the last 6 months or 1 year

In [19]:
import pandas as pd, numpy as np, matplotlib.pyplot as plt ,seaborn as sns

CLEANING DATa

In [23]:
x=pd.read_csv('numeric_data_cleaning.csv')

df=x.copy()
print(df.head(),'\n')      # review 1st 5 rows
print(df.info())      # info about data -- it shows 2 missing values

   ID  Age    Income  Score  Purchases
0   1   56  37972.58  75.24          4
1   2   46  70760.24  79.16          4
2   3   32  71078.08  78.96          5
3   4   25  70884.89   9.12          4
4   5   38  36790.39  49.44          1 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ID         200 non-null    int64  
 1   Age        200 non-null    int64  
 2   Income     199 non-null    float64
 3   Score      199 non-null    float64
 4   Purchases  200 non-null    int64  
dtypes: float64(2), int64(3)
memory usage: 7.9 KB
None


## Missing Data

In [None]:
# print(df[['Income','Score']].to_string())    # using for observation
df['Income'].fillna(df['Income'].mean().round(2),inplace=True)              # income- filled

# before filling score first check outliars
print(np.where((df['Score']>100) | (df['Score']<0)))                        # row 75 is outliar
df.loc[75,'Score']=100                                                      # outlier change
df['Score'].fillna(df['Score'].mean().round(2),inplace=True)                # Score- filled


(array([], dtype=int64),)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ID         200 non-null    int64  
 1   Age        200 non-null    int64  
 2   Income     200 non-null    float64
 3   Score      200 non-null    float64
 4   Purchases  200 non-null    int64  
dtypes: float64(2), int64(3)
memory usage: 7.9 KB
None


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Income'].fillna(df['Income'].mean().round(2),inplace=True)              # income- filled
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Score'].fillna(df['Score'].mean().round(2),inplace=True)                # Score- filled



## WRONG FORMAT

In [38]:
print(df.info())        ## no object =no wrong format

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ID         200 non-null    int64  
 1   Age        200 non-null    int64  
 2   Income     200 non-null    float64
 3   Score      200 non-null    float64
 4   Purchases  200 non-null    int64  
dtypes: float64(2), int64(3)
memory usage: 7.9 KB
None


## WRONG DATA

In [53]:
print(df.head())

# check negative values
print(np.where(df<0))    # means row 100 ,colm Age  , is negative replacing value


df.loc[100,'Age']=int(df['Age'].mean())     # now no negative values

#print(df.to_string())                      # on observing data there we can say that avg purchase is 10

print(np.where(df['Purchases']>20))         # cheking row 150
print(df.loc[150,'Purchases'])      #1000
df.loc[150,'Purchases']=10


   ID  Age    Income  Score  Purchases
0   1   56  37972.58  75.24          4
1   2   46  70760.24  79.16          4
2   3   32  71078.08  78.96          5
3   4   25  70884.89   9.12          4
4   5   38  36790.39  49.44          1
(array([], dtype=int64), array([], dtype=int64))
(array([150], dtype=int64),)
1000


## FIXING DATATYPES AND OTHERS
* like scores must be int
* chnage income column name
* conv income to int

In [74]:
print(df.to_string())
df['Score']=df['Score'].astype(int)
df=df.rename(columns={'Income':'Income in Rs.'})      # columnn name changed
df['Income in Rs.']=df['Income in Rs.'].astype(int)


      ID  Age  Income in Rs.  Score  Purchases
0      1   56          37972     75          4
1      2   46          70760     79          4
2      3   32          71078     78          5
3      4   25          70884      9          4
4      5   38          36790     49          1
5      6   56          52692      5          4
6      7   36          42598     54          6
7      8   40          63847     44          3
8      9   28          75599     88          6
9     10   28          63103     35          2
10    11   41          50137     11          4
11    12   53          44516     14          7
12    13   57          59736     50          6
13    14   41          31656     61          5
14    15   20          58045     10          4
15    16   39          36279      8          3
16    17   19          59308     70          3
17    18   41          47585      7          7
18    19   47          44176     82          7
19    20   55          36717     70          6
20    21   19

In [75]:
##  file cleaned
df.to_csv('cleaned_file.csv')