Importing pandas

In [14]:
import pandas as pd


Load a CSV file into a Pandas DataFrame


In [15]:
df = pd.read_csv('/content/01.Data Cleaning and Preprocessing.csv')


Display the first few rows of the DataFrame

In [16]:
print("First few rows of the dataset:")
print(df.head())

First few rows of the dataset:
  Observation  Y-Kappa  ChipRate  BF-CMratio  BlowFlow  ChipLevel4   \
0    31-00:00    23.10    16.520     121.717  1177.607      169.805   
1    31-01:00    27.60    16.810      79.022  1328.360      341.327   
2    31-02:00    23.19    16.709      79.562  1329.407      239.161   
3    31-03:00    23.60    16.478      81.011  1334.877      213.527   
4    31-04:00    22.90    15.618      93.244  1334.168      243.131   

   T-upperExt-2   T-lowerExt-2    UCZAA  WhiteFlow-4   ...  SteamFlow-4   \
0        358.282         329.545  1.443       599.253  ...        67.122   
1        351.050         329.067  1.549       537.201  ...        60.012   
2        350.022         329.260  1.600       549.611  ...        61.304   
3        350.938         331.142  1.604       623.362  ...        68.496   
4        351.640         332.709    NaN       638.672  ...        70.022   

   Lower-HeatT-3  Upper-HeatT-3   ChipMass-4   WeakLiquorF   BlackFlow-2   \
0       

Identify numeric columns


In [17]:
numeric_columns = df.select_dtypes(include=['number']).columns.tolist()


Handle missing values only in numeric columns by filling them with the mean


In [18]:
df_filled = df.copy()
for col in numeric_columns:
    df_filled[col] = df_filled[col].fillna(df_filled[col].mean())


In [19]:
print("\nData with missing values handled (filled with mean) for numeric columns:")
print(df_filled)



Data with missing values handled (filled with mean) for numeric columns:
    Observation  Y-Kappa  ChipRate  BF-CMratio  BlowFlow  ChipLevel4   \
0      31-00:00    23.10    16.520     121.717  1177.607      169.805   
1      31-01:00    27.60    16.810      79.022  1328.360      341.327   
2      31-02:00    23.19    16.709      79.562  1329.407      239.161   
3      31-03:00    23.60    16.478      81.011  1334.877      213.527   
4      31-04:00    22.90    15.618      93.244  1334.168      243.131   
..          ...      ...       ...         ...       ...          ...   
319    10-16:00    23.75    12.667      93.450  1178.252      276.955   
320     9-19:00    19.80    12.558      94.352  1184.119      297.071   
321     9-20:00    23.01    12.550      90.842  1188.517      289.826   
322     9-21:00    24.32    13.083      88.910  1192.879      318.006   
323     9-22:00    25.75    13.417      85.451  1186.342      248.312   

     T-upperExt-2   T-lowerExt-2      UCZAA  Whit

Filter data based on a condition (example: filter rows where 'ChipRate' > 50)

In [21]:
if 'ChipRate' in numeric_columns:
    filtered_df = df_filled[df_filled['ChipRate'] > 50]
    print("\nFiltered data (ChipRate > 50):")
    print(filtered_df)



Filtered data (ChipRate > 50):
Empty DataFrame
Columns: [Observation, Y-Kappa, ChipRate, BF-CMratio, BlowFlow, ChipLevel4 , T-upperExt-2 , T-lowerExt-2  , UCZAA, WhiteFlow-4 , AAWhiteSt-4 , AA-Wood-4  , ChipMoisture-4 , SteamFlow-4 , Lower-HeatT-3, Upper-HeatT-3 , ChipMass-4 , WeakLiquorF , BlackFlow-2 , WeakWashF , SteamHeatF-3 , T-Top-Chips-4 , SulphidityL-4 ]
Index: []

[0 rows x 23 columns]


New filtering operation based on 'SteamFlow-4' column
Ensure the column is numeric before filtering

In [22]:
if 'SteamFlow-4' in numeric_columns:
    filtered_df_steamflow = df_filled[df_filled['SteamFlow-4'] > 50]  # Adjust the threshold as needed
    print("\nFiltered data (SteamFlow-4 > 50):")
    print(filtered_df_steamflow)

In [23]:
if 'T-upperExt-2' in numeric_columns and 'T-lowerExt-2' in numeric_columns:
    df_filled['T-totalExt'] = df_filled['T-upperExt-2'] + df_filled['T-lowerExt-2']
    print("\nData with new column 'T-totalExt':")
    print(df_filled.head())

Calculate summary statistics for numeric columns


In [24]:
summary_statistics = df_filled.describe()
print("\nSummary statistics of the dataset for numeric columns:")
print(summary_statistics)


Summary statistics of the dataset for numeric columns:
          Y-Kappa    ChipRate  BF-CMratio     BlowFlow  ChipLevel4   \
count  324.000000  324.000000  324.000000   324.000000   324.000000   
mean    20.635370   14.347937   87.464456  1237.837614   258.164483   
std      3.070036    1.487447    7.781774    98.070606    87.851143   
min     12.170000    9.983000   68.645000     0.000000     0.000000   
25%     18.382500   13.364750   82.156750  1194.525750   213.527000   
50%     20.845000   14.347937   87.253500  1254.658500   271.605500   
75%     23.032500   15.498250   92.123250  1288.628750   321.285000   
max     27.600000   16.958000  121.717000  1351.240000   419.014000   

       T-upperExt-2   T-lowerExt-2         UCZAA  WhiteFlow-4   AAWhiteSt-4   \
count     324.000000       324.00000  324.000000     324.00000    324.000000   
mean      356.904295       324.02018    1.492010     591.73226      6.140410   
std         9.180734         7.59777    0.101741      66.91253  