<a href="https://colab.research.google.com/github/Nithi7518/Lost-and-Found/blob/main/getting_started_tutorials/rapids-pip-colab-template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a target="_blank" href="https://colab.research.google.com/github/rapidsai-community/showcase/blob/main/getting_started_tutorials/rapids-pip-colab-template.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Install RAPIDS into Colab"/>
</a>

# RAPIDS cuDF is now already on your Colab instance!
RAPIDS cuDF is preinstalled on Google Colab and instantly accelerates Pandas with zero code changes. [You can quickly get started with our tutorial notebook](https://nvda.ws/rapids-cudf). This notebook template is for users who want to utilize the full suite of the RAPIDS libraries for their workflows on Colab.  

# Environment Sanity Check #

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

You can check the output of `!nvidia-smi` to check which GPU you have.  Please uncomment the cell below if you'd like to do that.  Currently, RAPIDS runs on all available Colab GPU instances.

In [None]:
# !nvidia-smi

#Setup:
This set up script:

1. Checks to make sure that the GPU is RAPIDS compatible
1. Pip Installs the RAPIDS' libraries, which are:
  1. cuDF
  1. cuML
  1. cuGraph
  1. cuSpatial
  1. cuxFilter
  1. cuCIM
  1. xgboost

# Controlling Which RAPIDS Version is Installed
This line in the cell below, `!python rapidsai-csp-utils/colab/pip-install.py`, kicks off the RAPIDS installation script.  You can control the RAPIDS version installed by adding either `latest`, `nightlies` or the default/blank option.  Example:

`!python rapidsai-csp-utils/colab/pip-install.py <option>`

You can now tell the script to install:
1. **RAPIDS + Colab Default Version**, by leaving the install script option blank (or giving an invalid option), adds the rest of the RAPIDS libraries to the RAPIDS cuDF library preinstalled on Colab.  **This is the default and recommended version.**  Example: `!python rapidsai-csp-utils/colab/pip-install.py`
1. **Latest known working RAPIDS stable version**, by using the option `latest` upgrades all RAPIDS labraries to the latest working RAPIDS stable version.  Usually early access for future RAPIDS+Colab functionality - some functionality may not work, but can be same as the default version. Example: `!python rapidsai-csp-utils/colab/pip-install.py latest`
1. **the current nightlies version**, by using the option, `nightlies`, installs current RAPIDS nightlies version.  For RAPIDS Developer use - **not recommended/untested**.  Example: `!python rapidsai-csp-utils/colab/pip-install.py nightlies`


**This will complete in about 5-6 minutes**

In [None]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py


Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 490, done.[K
remote: Counting objects: 100% (221/221), done.[K
remote: Compressing objects: 100% (130/130), done.[K
remote: Total 490 (delta 149), reused 124 (delta 91), pack-reused 269[K
Receiving objects: 100% (490/490), 136.70 KiB | 6.21 MiB/s, done.
Resolving deltas: 100% (251/251), done.
Collecting pynvml
  Downloading pynvml-11.5.0-py3-none-any.whl (53 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.1/53.1 kB 714.5 kB/s eta 0:00:00
Installing collected packages: pynvml
Successfully installed pynvml-11.5.0
Installing the rest of the RAPIDS 24.4.* libraries
Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
Collecting cuml-cu12==24.4.*
  Downloading https://pypi.nvidia.com/cuml-cu12/cuml_cu12-24.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1200.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 GB 1.1 MB/s eta 0:00:00
Collecting cugraph-cu12==24.4.*
  Downloading

# RAPIDS is now installed on Colab.  
You can copy your code into the cells below or use the below to validate your RAPIDS installation and version.  
# Enjoy!

In [1]:
import cudf
cudf.__version__

'25.02.01'

In [4]:
import cupy
cupy.__version__

'13.3.0'

# Next Steps #

For an overview of how you can access and work with your own datasets in Colab, check out [this guide](https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92).

For more RAPIDS examples, check out our RAPIDS notebooks repos:
1. https://github.com/rapidsai/notebooks
2. https://github.com/rapidsai/notebooks-contrib

In [19]:
# CELL 1: Setup & Imports (RAPIDS Version)
import cudf
import cupy as cp
import re

from cuml.model_selection import train_test_split
from cuml.feature_extraction.text import TfidfVectorizer
from cuml.linear_model import LogisticRegression
from cuml.metrics import accuracy_score
from sklearn.metrics import classification_report
from cuml.pipeline import Pipeline
import time

print("Libraries (cuDF, cuML, CuPy) imported successfully.")

Libraries (cuDF, cuML, CuPy) imported successfully.


In [20]:
# CELL 3: Configuration (Mostly unchanged)

DATASET_PATH = '/content/Hotel_Reviews.csv'

NEGATIVE_REVIEW_COL = 'Negative_Review'
POSITIVE_REVIEW_COL = 'Positive_Review'
SCORE_COLUMN = 'Reviewer_Score'

SENTIMENT_THRESHOLD = 5.0

POSITIVE_LABEL = 'positive'
NEGATIVE_LABEL = 'negative'

sentiment_map = {NEGATIVE_LABEL: 0, POSITIVE_LABEL: 1}

print("Configuration set:")
print(f"  Dataset Path: {DATASET_PATH}")
print(f"  Negative Review Column: {NEGATIVE_REVIEW_COL}")
print(f"  Positive Review Column: {POSITIVE_REVIEW_COL}")
print(f"  Score Column (for sentiment): {SCORE_COLUMN}")
print(f"  Sentiment Threshold (Score > {SENTIMENT_THRESHOLD} is Positive): {SENTIMENT_THRESHOLD}")
print(f"  Sentiment Map: {sentiment_map}")

Configuration set:
  Dataset Path: /content/Hotel_Reviews.csv
  Negative Review Column: Negative_Review
  Positive Review Column: Positive_Review
  Score Column (for sentiment): Reviewer_Score
  Sentiment Threshold (Score > 5.0 is Positive): 5.0
  Sentiment Map: {'negative': 0, 'positive': 1}


In [21]:
# CELL 4: Load Data (Using cuDF)

print(f"Loading dataset from: {DATASET_PATH} using cuDF...")
columns_to_load = [NEGATIVE_REVIEW_COL, POSITIVE_REVIEW_COL, SCORE_COLUMN]

try:
    df = cudf.read_csv(DATASET_PATH, usecols=columns_to_load)
    print(f"Dataset loaded successfully using cuDF. Shape: {df.shape}")
    print("First 5 rows of loaded data:")
    print(df.head())
except FileNotFoundError:
    print(f"Error: Dataset file not found at '{DATASET_PATH}'. Check path and upload status.")
    raise
except ValueError as e:
     print(f"Error: Problem loading columns: {e}. ")
     print(f"Ensure CSV has columns: {columns_to_load} and check config in Cell 3.")
     raise
except Exception as e:
    print(f"An unexpected error occurred during loading: {e}")
    raise

Loading dataset from: /content/Hotel_Reviews.csv using cuDF...
Dataset loaded successfully using cuDF. Shape: (515738, 3)
First 5 rows of loaded data:
                                     Negative_Review  \
0   I am so angry that i made this post available...   
1                                        No Negative   
2   Rooms are nice but for elderly a bit difficul...   
3   My room was dirty and I was afraid to walk ba...   
4   You When I booked with your company on line y...   

                                     Positive_Review  Reviewer_Score  
0   Only the park outside of the hotel was beauti...             2.9  
1   No real complaints the hotel was great great ...             7.5  
2   Location was good and staff were ok It is cut...             7.1  
3   Great location in nice surroundings the bar a...             3.8  
4    Amazing location and building Romantic setting              6.7  


In [12]:
# CELL 5: Preprocessing Data (Using cuDF - Text Cleaning CORRECTED)

print("Preprocessing data using cuDF...")
start_time = time.time()

# --- Basic Checks and Cleaning ---
# Drop rows where the score is missing
df = df.dropna(subset=[SCORE_COLUMN]) # Use cuDF's dropna
# Ensure review columns are strings before cleaning
df[NEGATIVE_REVIEW_COL] = df[NEGATIVE_REVIEW_COL].astype(str)
df[POSITIVE_REVIEW_COL] = df[POSITIVE_REVIEW_COL].astype(str)
initial_rows = df.shape[0]
print(f"Initial rows after dropping NA scores: {initial_rows}")

# --- Text Cleaning Function (CORRECTED for GPU Compatibility) ---
def clean_text_gpu(text_series):
    # 1. Convert to lowercase first (handles potential NAs safely)
    text_series = text_series.fillna('').str.lower()

    # 2. Handle standard "No Negative"/"No Positive" (now case-sensitive on lowercase text)
    # Use regex=False for faster fixed string replacement.
    text_series = text_series.str.replace('no negative', '', regex=False)
    text_series = text_series.str.replace('no positive', '', regex=False)

    # 3. Remove punctuation and numbers using REGEX replace
    # Keep only letters and spaces. Use .str.replace with regex=True.
    text_series = text_series.str.replace(r'[^a-z\s]', '', regex=True) # CORRECTED METHOD

    # 4. Remove extra whitespace using REGEX replace, then strip
    text_series = text_series.str.replace(r'\s+', ' ', regex=True) # CORRECTED METHOD
    text_series = text_series.str.strip()

    # Optional: Fill any remaining NaNs potentially introduced
    text_series = text_series.fillna('')
    return text_series

# --- Combine and Clean Text ---
print("Cleaning and combining negative and positive review text...")
# Apply revised cleaning function to both review columns
cleaned_neg = clean_text_gpu(df[NEGATIVE_REVIEW_COL])
cleaned_pos = clean_text_gpu(df[POSITIVE_REVIEW_COL])

# Combine cleaned positive and negative reviews
# Add a space in between only if both parts have content
df['cleaned_text'] = cleaned_neg + ' ' + cleaned_pos
df['cleaned_text'] = df['cleaned_text'].str.strip() # Remove leading/trailing spaces from combine
print("Text combination and cleaning done.")

# --- Create Sentiment Labels from Score ---
print(f"Creating sentiment labels based on '{SCORE_COLUMN}' > {SENTIMENT_THRESHOLD}...")
# Ensure score column is numeric first
df[SCORE_COLUMN] = df[SCORE_COLUMN].astype('float32')
# Apply threshold (cuDF is efficient with numeric/boolean ops)
df['sentiment_numeric'] = (df[SCORE_COLUMN] > SENTIMENT_THRESHOLD).astype('int32')
print("Sentiment label creation done.")

# --- Final Cleanup ---
# Remove rows where the combined text is empty after cleaning
df = df[df['cleaned_text'].str.len() > 0]
rows_after_empty_text_drop = df.shape[0]
print(f"Removed {initial_rows - rows_after_empty_text_drop} rows with empty/NA text or scores during preprocessing.") # Adjusted count message

processed_rows = df.shape[0]
print(f"\nPreprocessing finished. Rows remaining for analysis: {processed_rows}")
print(f"Time taken for preprocessing: {time.time() - start_time:.2f} seconds")

# --- Checks and Final Output ---
if processed_rows < 50:
    print(f"\nWarning: Only {processed_rows} rows remaining after preprocessing.")
    if processed_rows == 0:
      print("Error: No data remaining.")
      raise SystemExit("Stopping due to no data.")

print(f"\nValue counts for derived sentiment (0={NEGATIVE_LABEL}, 1={POSITIVE_LABEL} based on score > {SENTIMENT_THRESHOLD}):")
try:
    # value_counts returns a cuDF Series
    print(df['sentiment_numeric'].value_counts())
except Exception as e:
    print(f"Could not get value counts: {e}")

print("\nSample of combined cleaned text, original score, and derived sentiment:")
try:
    print(df[['cleaned_text', SCORE_COLUMN, 'sentiment_numeric']].head())
except Exception as e:
    print(f"Could not display head(): {e}")

# Optional: Clean up intermediate GPU memory
# del cleaned_neg, cleaned_pos
# import gc; gc.collect()

Preprocessing data using cuDF...
Initial rows after dropping NA scores: 101465
Cleaning and combining negative and positive review text...
Text combination and cleaning done.
Creating sentiment labels based on 'Reviewer_Score' > 5.0...
Sentiment label creation done.
Removed 44 rows with empty/NA text or scores during preprocessing.

Preprocessing finished. Rows remaining for analysis: 101421
Time taken for preprocessing: 0.60 seconds

Value counts for derived sentiment (0=negative, 1=positive based on score > 5.0):
sentiment_numeric
1    95107
0     6314
Name: count, dtype: int64

Sample of combined cleaned text, original score, and derived sentiment:
                                        cleaned_text  Reviewer_Score  \
0  i am so angry that i made this post available ...             2.9   
1  no real complaints the hotel was great great l...             7.5   
2  rooms are nice but for elderly a bit difficult...             7.1   
3  my room was dirty and i was afraid to walk bar...

In [14]:
# CELL 6: Split Data into Training and Testing Sets (Revised - Index Splitting)
import cudf
import cupy as cp
from cuml.model_selection import train_test_split # Ensure this import is present

print("Splitting data into training and testing sets using cuML (via index splitting)...")

# Make sure df exists and is not empty (check assumes df is in locals scope from Cell 5)
if 'df' not in locals() or df.empty:
    print("Error: Cannot split data - DataFrame 'df' is not available or empty.")
    # Depending on your flow, you might want to raise an error here
    # raise ValueError("DataFrame 'df' is missing or empty before splitting.")
else:
    # Define features (X) and target (y) as cuDF Series
    X = df['cleaned_text']
    y = df['sentiment_numeric']

    # 1. Create a CuPy array of numerical indices for the DataFrame
    num_rows = df.shape[0]
    indices = cp.arange(num_rows) # CuPy array [0, 1, 2, ..., n-1]

    # 2. Convert the target Series 'y' to a CuPy array for stratification
    # Stratify function usually expects numerical arrays
    try:
        y_cupy = y.to_cupy()
        print(f"Target 'y' converted to CuPy array for stratification. Shape: {y_cupy.shape}")
    except Exception as e:
        print(f"Error converting target 'y' to CuPy array: {e}")
        print("Ensure 'y' (sentiment_numeric) is a numeric type in the DataFrame.")
        raise

    print(f"Splitting {len(indices)} indices based on target labels...")
    try:
        # 3. Split the NUMERICAL INDICES, stratifying by the CuPy target array
        train_idx_cp, test_idx_cp = train_test_split(
            indices,              # Split the indices array
            test_size=0.25,       # Use 25% of data for testing
            random_state=42,      # For reproducibility
            stratify=y_cupy,      # Stratify using the CuPy version of y
            shuffle=True          # Shuffle indices before splitting
        )
        print(f"Indices split successfully. Train indices: {len(train_idx_cp)}, Test indices: {len(test_idx_cp)}")

        # 4. Use the split indices to select data from the original cuDF Series
        # cuDF's .iloc usually accepts CuPy arrays or cuDF integer Series/Indices directly
        X_train = X.iloc[train_idx_cp]
        X_test = X.iloc[test_idx_cp]
        y_train = y.iloc[train_idx_cp]
        y_test = y.iloc[test_idx_cp]

        print(f"\nTraining set size: {len(X_train)}")
        print(f"Testing set size: {len(X_test)}")
        # Optional: Check types to confirm they are cuDF Series
        # print(f"X_train type: {type(X_train)}")
        # print(f"y_train type: {type(y_train)}")

    except ValueError as ve:
         # Catch potential errors even during index splitting
         print(f"\nValueError during index splitting or data selection: {ve}")
         raise
    except Exception as e:
        print(f"\nAn unexpected error occurred during splitting: {e}")
        raise

Splitting data into training and testing sets using cuML (via index splitting)...
Target 'y' converted to CuPy array for stratification. Shape: (101421,)
Splitting 101421 indices based on target labels...
Indices split successfully. Train indices: 76066, Test indices: 25355

Training set size: 76066
Testing set size: 25355


In [15]:
# CELL 7: Create TF-IDF and Model Pipeline (Using cuML)
print("Setting up cuML TF-IDF Vectorizer and Logistic Regression model...")

# cuML's TfidfVectorizer (API is very similar to sklearn's)
tfidf_vectorizer = TfidfVectorizer(
    ngram_range=(1, 2),
    max_features=20000,
    stop_words='english'
    # cuML's TF-IDF might default to different norms or parameters, check docs if needed
)

# cuML's Logistic Regression
# Check cuML documentation for exact parameter equivalence and availability (e.g., 'class_weight')
# As of recent versions, class_weight='balanced' IS often supported.
logistic_regression_model = LogisticRegression(
    # cuML's LogReg might have different default solvers or parameters.
    # It often uses highly optimized solvers like OWL-QN.
    penalty='l2', # Common default
    C=1.0,        # Regularization strength
    random_state=42,
    class_weight='balanced', # TRY THIS - Check if supported in your cuML version
    max_iter=500 # Adjust max_iter if needed
)

# Use cuML's Pipeline
model_pipeline = Pipeline([
    ('tfidf', tfidf_vectorizer),
    ('clf', logistic_regression_model)
])

print("cuML Pipeline created successfully.")
# print(model_pipeline) # Optional: print pipeline structure

Setting up cuML TF-IDF Vectorizer and Logistic Regression model...
[2025-04-14 16:46:36.568] [CUML] [info] Unused keyword parameter: random_state during cuML estimator initialization
cuML Pipeline created successfully.


In [16]:
# CELL 8: Train the Model (Using cuML Pipeline)
print("Training the cuML model pipeline...")
start_time = time.time()

# Check if training data exists (should be cuDF Series)
if 'X_train' in locals() and not X_train.empty:
    # .fit() trains the whole cuML pipeline on GPU data
    model_pipeline.fit(X_train, y_train)
    print(f"Model training completed.")
    print(f"Time taken for training: {time.time() - start_time:.2f} seconds")
else:
    print("Error: Training data (X_train cuDF Series) is not available or empty.")

Training the cuML model pipeline...
Model training completed.
Time taken for training: 6.92 seconds


In [17]:
# CELL 9: Evaluate the Model (Using cuML Metrics / Sklearn Report)
print("Evaluating the model on the test set...")
start_time = time.time()

# Check if model and test data exist
if 'model_pipeline' in locals() and 'X_test' in locals() and not X_test.empty:
    try:
      # Predict using the cuML pipeline (output is typically CuPy array or cuDF Series)
      y_pred = model_pipeline.predict(X_test)

      # Ensure y_test and y_pred are compatible for metrics (e.g., cupy arrays)
      # Convert if necessary, though cuML metrics often handle cuDF/CuPy directly
      # y_test_cp = cp.asarray(y_test.values) # Example conversion if needed
      # y_pred_cp = cp.asarray(y_pred)       # Example conversion if needed

      # Use cuML's accuracy score
      accuracy = accuracy_score(y_test, y_pred) # Pass cuDF Series directly

      print(f"Evaluation completed.")
      print(f"Time taken for evaluation: {time.time() - start_time:.2f} seconds")
      print("\n--- Evaluation Results ---")
      print(f"Accuracy (cuML): {accuracy:.4f}")

      # For the detailed classification report, pull data to CPU and use sklearn's
      print("\nCalculating Classification Report (requires data transfer to CPU)...")
      y_test_cpu = y_test.get() # Get as numpy array on CPU
      y_pred_cpu = y_pred.get() # Get as numpy array on CPU (if cupy) or y_pred.to_numpy() if cuDF Series

      report = classification_report(y_test_cpu, y_pred_cpu, target_names=[NEGATIVE_LABEL, POSITIVE_LABEL])
      print("\nClassification Report (sklearn on CPU data):")
      print(report)
      print("------------------------")

    except Exception as e:
        print(f"An error occurred during evaluation: {e}")
        # Consider adding more specific error handling for cuML/CuPy issues
        # import traceback
        # traceback.print_exc()
else:
    print("Error: Model or test data is not available. Cannot evaluate.")

Evaluating the model on the test set...
Evaluation completed.
Time taken for evaluation: 0.53 seconds

--- Evaluation Results ---
Accuracy (cuML): 0.8715

Calculating Classification Report (requires data transfer to CPU)...
An error occurred during evaluation: 'Series' object has no attribute 'get'


In [18]:
# CELL 10: Example Prediction on New Data (Using cuML Pipeline)
print("\n--- Example Prediction ---")

# New reviews as a standard Python list
new_reviews_list = [
    "The room was the worst experience",
    "Absolutely fantastic stay, loved the service and the clean room!",
    "It was okay, nothing special but not bad either."
]

# Check if model pipeline exists
if 'model_pipeline' in locals() and 'sentiment_map' in locals():
    try:
        # 1. Convert Python list to cuDF Series for GPU processing
        new_reviews_cudf = cudf.Series(new_reviews_list)

        # 2. Use the trained cuML pipeline to predict
        # Output is typically a CuPy array
        new_predictions_numeric_gpu = model_pipeline.predict(new_reviews_cudf)
        new_predictions_proba_gpu = model_pipeline.predict_proba(new_reviews_cudf)

        # 3. Transfer predictions back to CPU (NumPy arrays) for printing/mapping
        new_predictions_numeric = cp.asnumpy(new_predictions_numeric_gpu)
        new_predictions_proba = cp.asnumpy(new_predictions_proba_gpu)

        # 4. Map numeric predictions back to labels (same logic as before)
        label_map_rev = {v: k for k, v in sentiment_map.items()} # Reverse map {0: 'negative', 1: 'positive'}

        print("Prediction Results:")
        for i, review in enumerate(new_reviews_list): # Iterate through original list
            pred_label = label_map_rev[new_predictions_numeric[i]]
            neg_prob = new_predictions_proba[i][0] # Probability of class 0 (negative)
            pos_prob = new_predictions_proba[i][1] # Probability of class 1 (positive)
            print(f"\nReview: \"{review}\"")
            print(f"Predicted Sentiment: {pred_label}")
            print(f"Probabilities -> {NEGATIVE_LABEL}: {neg_prob:.4f}, {POSITIVE_LABEL}: {pos_prob:.4f}")
        print("------------------------")

    except Exception as e:
        print(f"An error occurred during example prediction: {e}")
        # import traceback
        # traceback.print_exc()
else:
    print("Error: Model pipeline or sentiment map not available for prediction.")


--- Example Prediction ---
Prediction Results:

Review: "The room was the worst experience"
Predicted Sentiment: negative
Probabilities -> negative: 0.9630, positive: 0.0370

Review: "Absolutely fantastic stay, loved the service and the clean room!"
Predicted Sentiment: positive
Probabilities -> negative: 0.0303, positive: 0.9697

Review: "It was okay, nothing special but not bad either."
Predicted Sentiment: negative
Probabilities -> negative: 0.8825, positive: 0.1175
------------------------
