## Ans1 a)

Identify Data Sources: Determine the different sources from which you want to collect data. This can include databases, APIs, message queues, log files, or streaming platforms.

Define Data Collection Strategy: Decide on the frequency and method of data collection for each source. This can involve scheduled batch processing, event-driven processing, or real-time streaming.

Extract Data: Develop connectors or APIs to extract data from each source. Use appropriate libraries, APIs, or query languages specific to the data source, such as SQL for databases or REST APIs for web services.

Transform and Normalize Data: Perform any necessary data transformation and normalization to ensure consistency and compatibility across different sources. This may involve data type conversions, merging or splitting data, or handling missing values.

Apply Data Validation and Cleansing: Implement validation and cleansing steps to ensure the integrity and quality of the collected data. This can include data validation rules, data type validation, duplicate removal, or outlier detection.

Store Data: Determine the storage mechanism based on your requirements. This can involve relational or NoSQL databases, data lakes, or cloud storage services like Amazon S3 or Google Cloud Storage.

Data Partitioning and Indexing: Implement strategies to partition and index the data based on your retrieval and analysis needs. Partitioning can be based on time, location, or any other relevant criteria, while indexing improves query performance.

Error Handling and Retry Mechanism: Implement error handling mechanisms to capture and handle any exceptions that occur during data ingestion. Set up a retry mechanism to handle temporary failures or connectivity issues.

Monitoring and Alerting: Implement monitoring tools and metrics to track the health and performance of the data ingestion pipeline. Set up alerts to notify administrators of any issues or anomalies, such as data source unavailability or data ingestion failures.
---------------------------------------------------------------------------------------------------------------------------------------
## b)

IoT Device Integration: Establish a connection with the IoT devices to collect sensor data. This may involve using protocols such as MQTT or CoAP and setting up appropriate device registries or management platforms.

Data Streaming: Set up a real-time data streaming framework to handle the continuous flow of sensor data. Popular streaming platforms like Apache Kafka or AWS Kinesis can be used for this purpose.

Data Serialization and Encoding: Choose an appropriate data serialization format such as Avro, Protocol Buffers, or JSON to encode the sensor data for streaming.

Real-time Processing: Implement real-time processing components to analyze and transform the incoming sensor data. This can involve performing aggregations, filtering, feature extraction, or applying machine learning models in real-time.

Data Storage and Persistence: Store the processed sensor data in a suitable storage system. Depending on the use case, this can be a database, data warehouse, or data lake.

Data Visualization and Monitoring: Develop visualizations or dashboards to provide real-time insights and monitoring of the sensor data. This allows stakeholders to track the system's performance and make informed decisions based on the collected data.
----------------------------------------------------------------------------------------------------------------------------------------
## c)

File Format Detection: Implement a mechanism to detect the file format of incoming data files. This can be done based on file extensions or by inspecting the file content.

Data Parsing: Develop parsers or libraries to parse and extract data from different file formats. Use appropriate libraries or APIs for each format, such as CSV parsers or JSON deserializers.

Data Validation: Apply data validation rules to ensure the integrity and quality of the ingested data. This can include checking for data type conformity, enforcing data constraints, or performing format-specific validation.

Data Cleansing and Transformation: Implement data cleansing and transformation steps to handle inconsistencies or errors in the ingested data. This can involve data cleaning techniques like removing duplicates, handling missing values, or standardizing data formats.

Schema Mapping: Define a schema or mapping mechanism to map the incoming data to a standardized format or schema. This helps ensure consistency across different data sources and simplifies downstream processing.

## Ans 3 a)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [3]:
from sklearn.datasets import fetch_california_housing
dataset=fetch_california_housing()

In [7]:
df=pd.DataFrame(dataset.data,columns=dataset.feature_names)
df['Price']=dataset.target

In [9]:
X=df.iloc[:,0:8]
y=df['Price']

In [13]:
## Cross Validation
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.33,random_state=67)
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(X_train,y_train)
from sklearn.model_selection import cross_val_score
score=cross_val_score(LinearRegression(),X=X_train,y=y_train,cv=5)
print(np.mean(score))



0.6000490693530499


## Ans 3 b)

In [19]:
from sklearn.datasets import make_classification
independent_features,target=make_classification(n_samples=500,n_features=3,n_informative=1,n_classes=2,n_clusters_per_class=1,random_state=10)

In [21]:
clf_df=pd.DataFrame(independent_features,columns=['f1','f2','f3'])

In [22]:
clf_df['target']=target

In [27]:
X=clf_df.drop('target',axis=1)
y=clf_df.target
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.33,random_state=67)

In [28]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score,f1_score,precision_score,recall_score
clf=KNeighborsClassifier()
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
accuracy=accuracy_score(y_test,y_pred)
f1score=f1_score(y_test,y_pred)
precision=precision_score(y_test,y_pred)
recall=recall_score(y_test,y_pred)
print('accuracy',accuracy)
print('f1score',f1score)
print('precision',precision)
print('recall',recall)



accuracy 1.0
f1score 1.0
precision 1.0
recall 1.0


## Ans3 c)
Dataset Analysis: Analyze the class distribution of the imbalanced dataset to understand the extent of the imbalance. Identify the minority class (positive class) and majority class (negative class).

Stratified Sampling Approach: Implement stratified sampling to create training and validation sets. This approach ensures that the class distribution is maintained in both sets, providing a balanced representation of the classes.

Determine Sampling Ratio: Decide on the appropriate ratio for stratified sampling. The ratio can be set based on the severity of the class imbalance, ensuring that both classes have sufficient representation in the training and validation sets.

Random Sampling: Randomly sample instances from each class while maintaining the predetermined ratio. This ensures that the selected samples are representative of the original class distribution.

## Ans4 a)
Infrastructure Selection: Choose a suitable infrastructure for deploying the model. This can include cloud platforms, on-premises servers, or serverless architectures, depending on your requirements and scalability needs.

Real-time Data Ingestion: Set up a mechanism to capture and ingest user interactions or events in real-time. This can involve integrating with APIs, message queues, or streaming platforms to receive user data as it happens.

Model Deployment: Deploy the trained machine learning model to the chosen infrastructure. This can be done using frameworks like Flask, Django, or serverless computing platforms such as AWS Lambda or Azure Functions.
----------------------------------------------------------------------------------------------------------------------------------------
## b)
Version Control: Use a version control system like Git to manage the machine learning model's code and associated configuration files. Maintain separate branches for development, testing, and production environments.

Continuous Integration and Deployment (CI/CD): Set up a CI/CD pipeline to automate the deployment process. Configure triggers that initiate the pipeline whenever new code changes are pushed to the repository or based on predefined schedules.

Build and Packaging: Create a build script that packages the model code, dependencies, and any necessary configurations into deployable artifacts. Use tools like Docker or serverless frameworks (e.g., AWS SAM or Azure Functions) to encapsulate the model into containerized or serverless deployments.
----------------------------------------------------------------------------------------------------------------------------------------
## c)
Performance Monitoring: Set up monitoring systems to track the deployed model's performance metrics, including response times, resource utilization, throughput, and error rates. Use tools like Prometheus, Grafana, or cloud provider monitoring services to collect and visualize these metrics.

Log Collection and Analysis: Collect logs from the deployed model and associated infrastructure components. Use log aggregation tools like ELK Stack (Elasticsearch, Logstash, Kibana) or cloud-native logging services to centralize and analyze logs for troubleshooting and performance optimization.

Anomaly Detection: Apply anomaly detection techniques to identify unusual patterns or behaviors in the model's performance. Use statistical methods, machine learning algorithms, or anomaly detection platforms to automatically detect and alert on abnormal system behavior.