# **Example of how structured data might be represented and processed**

- It creates a DataFrame with customer data
- It separates the featuresand the target variable.
- It splits the data into training and testing sets.
- It trains a logistic regression model on the training data.
- Finally, it evaluates the model's accuracy on the test data and prints the result

In [1]:
#Import important libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [2]:
# Example of structured data
data = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'age': [28, 35, 42, 50, 33],
    'tenure': [12, 24, 36, 48, 6],
    'monthly_charge': [50.0, 70.0, 100.0, 80.0, 65.0],
    'churn': [0, 0, 1, 1, 0]
})

# Easy to perform operations on structured data
X = data[['age', 'tenure', 'monthly_charge']]
y = data['churn']

# Simple to use in machine learning models
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)
#Evaluate model
print("Model accuracy:", model.score(X_test, y_test))

Model accuracy: 1.0


# **Processing unstructured text data**

- It downloads the necessary NLTK resources for tokenization and stopwords.

- It defines a function to preprocess text by tokenizing it, converting to lowercase, removing stopwords, and keeping only alphabetic tokens.

- It demonstrates this function on an example customer review, outputting the processed tokens.

In [None]:
# Install nltk
!pip install nltk

In [4]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [5]:
# Download the necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

def preprocess_text(text):
    # Tokenize the text
    tokens = word_tokenize(text.lower())

    # Remove stopwords and non-alphabetic tokens
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token.isalpha() and token not in stop_words]

    return tokens

# Example unstructured data
customer_review = """
The product was great! It arrived on time and the quality exceeded my expectations.
However, the customer service could use some improvement. Overall, I’m satisfied.
"""

# Preprocess the example text
processed_tokens = preprocess_text(customer_review)
print("Processed Tokens:", processed_tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Processed Tokens: ['product', 'great', 'arrived', 'time', 'quality', 'exceeded', 'expectations', 'however', 'customer', 'service', 'could', 'use', 'improvement', 'overall', 'satisfied']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# **Working with semi-structured data in JSON format**

- The code snippet loads JSON data, converts it to a pandas DataFrame, and then accesses specific nested data for analysis, demonstrating semi-structured data handling.

- It utilizes pandas' json_normalize function to transform json strings into a structured DataFrame format, which is more suitable for analysis. This DataFrame is printed, displaying columns for each attribute, including nested specifications like processor details.

- The script extracts and prints the processor type ("Intel i7") for the laptop by accessing nested JSON data. It achieves this by filtering the DataFrame for the row where the product name is "Laptop" and then selecting the processor information from the nested 'specs' column.

In [6]:
#Import the necessary libraries
import json
import pandas as pd

# Example of semi-structured data in JSON format
json_data = '''
[
  {
    "product_id": "P001",
    "name": "Smartphone",
    "price": 599.99,
    "specs": {
      "screen": "6.2 inch",
      "battery": "4000 mAh",
      "camera": "12 MP"
    }
  },
  {
    "product_id": "P002",
    "name": "Laptop",
    "price": 1299.99,
    "specs": {
      "processor": "Intel i7",
      "ram": "16 GB",
      "storage": "512 GB SSD"
    }
  }
]
'''

# Parse JSON data
data = json.loads(json_data)

# Convert to pandas DataFrame
df = pd.json_normalize(data)

print(df)

# Accessing nested data
print("Processor of the laptop:")
print(df.loc[df['name'] == 'Laptop', 'specs.processor'].values[0])

  product_id        name    price specs.screen specs.battery specs.camera  \
0       P001  Smartphone   599.99     6.2 inch      4000 mAh        12 MP   
1       P002      Laptop  1299.99          NaN           NaN          NaN   

  specs.processor specs.ram specs.storage  
0             NaN       NaN           NaN  
1        Intel i7     16 GB    512 GB SSD  
Processor of the laptop:
Intel i7
