In [None]:
Global Power Plant Database
Project Description
The Global Power Plant Database is a comprehensive, open source database of power plants around the world. It centralizes power plant data to make it easier to navigate, compare and draw insights for one’s own analysis. The database covers approximately 35,000 power plants from 167 countries and includes thermal plants (e.g. coal, gas, oil, nuclear, biomass, waste, geothermal) and renewables (e.g. hydro, wind, solar). Each power plant is geolocated and entries contain information on plant capacity, generation, ownership, and fuel type. It will be continuously updated as data becomes available.
Key attributes of the database
The database includes the following indicators:
•	
•	`country` (text): 3 character country code corresponding to the ISO 3166-1 alpha-3 specification [5]
•	`country_long` (text): longer form of the country designation
•	`name` (text): name or title of the power plant, generally in Romanized form
•	`gppd_idnr` (text): 10 or 12 character identifier for the power plant
•	`capacity_mw` (number): electrical generating capacity in megawatts
•	`latitude` (number): geolocation in decimal degrees; WGS84 (EPSG:4326)
•	`longitude` (number): geolocation in decimal degrees; WGS84 (EPSG:4326)
•	`primary_fuel` (text): energy source used in primary electricity generation or export
•	`other_fuel1` (text): energy source used in electricity generation or export
•	`other_fuel2` (text): energy source used in electricity generation or export
•	`other_fuel3` (text): energy source used in electricity generation or export
•	 `commissioning_year` (number): year of plant operation, weighted by unit-capacity when data is available
•	`owner` (text): majority shareholder of the power plant, generally in Romanized form
•	`source` (text): entity reporting the data; could be an organization, report, or document, generally in Romanized form
•	`url` (text): web document corresponding to the `source` field
•	`geolocation_source` (text): attribution for geolocation information
•	`wepp_id` (text): a reference to a unique plant identifier in the widely-used PLATTS-WEPP database.
•	`year_of_capacity_data` (number): year the capacity information was reported
•	`generation_gwh_2013` (number): electricity generation in gigawatt-hours reported for the year 2013
•	`generation_gwh_2014` (number): electricity generation in gigawatt-hours reported for the year 2014
•	`generation_gwh_2015` (number): electricity generation in gigawatt-hours reported for the year 2015
•	`generation_gwh_2016` (number): electricity generation in gigawatt-hours reported for the year 2016
•	`generation_gwh_2017` (number): electricity generation in gigawatt-hours reported for the year 2017
•	`generation_gwh_2018` (number): electricity generation in gigawatt-hours reported for the year 2018
•	`generation_gwh_2019` (number): electricity generation in gigawatt-hours reported for the year 2019
•	`generation_data_source` (text): attribution for the reported generation information
•	`estimated_generation_gwh_2013` (number): estimated electricity generation in gigawatt-hours for the year 2013
•	`estimated_generation_gwh_2014` (number): estimated electricity generation in gigawatt-hours for the year 2014 
•	`estimated_generation_gwh_2015` (number): estimated electricity generation in gigawatt-hours for the year 2015 
•	`estimated_generation_gwh_2016` (number): estimated electricity generation in gigawatt-hours for the year 2016 
•	`estimated_generation_gwh_2017` (number): estimated electricity generation in gigawatt-hours for the year 2017 
•	'estimated_generation_note_2013` (text): label of the model/method used to estimate generation for the year 2013
•	`estimated_generation_note_2014` (text): label of the model/method used to estimate generation for the year 2014 
•	`estimated_generation_note_2015` (text): label of the model/method used to estimate generation for the year 2015
•	`estimated_generation_note_2016` (text): label of the model/method used to estimate generation for the year 2016
•	`estimated_generation_note_2017` (text): label of the model/method used to estimate generation for the year 2017 
Fuel Type Aggregation
We define the "Fuel Type" attribute of our database based on common fuel categories. 
Prediction :   Make two prediction  1) Primary Fuel    2) capacity_mw 

Dataset Link-
•	https://github.com/wri/global-power-plant-database/blob/master/source_databases_csv/database_IND.csv


In [None]:
import pandas as pd

#Primary fuel

url = "D:\Intern projects\database_IND.csv"
df = pd.read_csv(url)
print(df.head())
unique_fuels = df['primary_fuel'].unique()
print("Unique Fuel Types:", unique_fuels)


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer
import numpy as np

#capacity_mw

X = df[['commissioning_year', 'generation_gwh_2013', 'generation_gwh_2014', 'generation_gwh_2015', 'generation_gwh_2016', 'generation_gwh_2017', 'generation_gwh_2018', 'generation_gwh_2019']]
y = df['capacity_mw']
imputer = SimpleImputer(strategy='mean')
X = imputer.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("Mean Squared Error:", mse)
new_data_point = np.array([[2020, 1000, 1200, 1300, 1100, 1500, 1400, 1600]])
new_data_point = imputer.transform(new_data_point)
predicted_capacity = model.predict(new_data_point)
print("Predicted Capacity (MW):", predicted_capacity[0])


In [None]:
Loan Application Status Prediction
Project Description
This dataset includes details of applicants who have applied for loan. The dataset includes details like credit history, loan amount, their income, dependents etc. 
Independent Variables:
1.	Loan_ID - This refer to the unique identifier of the applicant's affirmed purchases
2.	Gender - This refers to either of the two main categories (male and female) into which applicants are divided on the basis of their reproductive functions
3.	Married - This refers to applicant being in a state of matrimony
4.	Dependents - This refres to persons who depends on the applicants for survival
5.	Education - This refers to number of years in which applicant received systematic instruction, especially at a school or university
6.	Self_Employed - This refers to applicant working for oneself as a freelancer or the owner of a business rather than for an employer
7.	Applicant Income - This refers to disposable income available for the applicant's use under State law.
8.	CoapplicantIncome - This refers to disposable income available for the people that participate in the loan application process alongside the main applicant use under State law.
9.	Loan_Amount - This refers to the amount of money an applicant owe at any given time.
10.	Loan_Amount_Term - This refers to the duaration in which the loan is availed to the applicant
11.	Credit History - This refers to a record of applicant's ability to repay debts and demonstrated responsibility in repaying them.
12.	Property_Area - This refers to the total area within the boundaries of the property as set out in Schedule.
13.	Loan_Status - This refres to whether applicant is eligible to be availed the Loan requested.
You have to build a model that can predict whether the loan of the applicant will be approved(Loan_status) or not on the basis of the details provided in the dataset. 
Dataset Link-  https://github.com/dsrscientist/DSData/blob/master/loan_prediction.csv


In [None]:
import pandas as pd
url = "https://github.com/dsrscientist/DSData/raw/master/loan_prediction.csv"
df = pd.read_csv(url)
print(df.info())
print(df.describe())
print(df.head())
df.fillna(df.mean(), inplace=True)
from scipy import stats
numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns
z_scores = stats.zscore(df[numeric_columns])
df = df[(z_scores < 3).all(axis=1)]
df = pd.get_dummies(df, columns=['Gender', 'Married', 'Education', 'Self_Employed', 'Property_Area'], drop_first=True)
numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns
from scipy import stats
z_scores = stats.zscore(df[numeric_columns])
df = df[(z_scores < 3).all(axis=1)]
from sklearn.model_selection import train_test_split
X = df.drop('Loan_Status', axis=1)
y = df['Loan_Status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
This data is for the purpose of bias correction of next-day maximum and minimum air temperatures forecast of the LDAPS model operated by the Korea Meteorological Administration over Seoul, South Korea. This data consists of summer data from 2013 to 2017. The input data is largely composed of the LDAPS model's next-day forecast data, in-situ maximum and minimum temperatures of present-day, and geographic auxiliary variables. There are two outputs (i.e. next-day maximum and minimum air temperatures) in this data. Hindcast validation was conducted for the period from 2015 to 2017.

Attribute Information:
For more information, read [Cho et al, 2020].
1. station - used weather station number: 1 to 25
2. Date - Present day: yyyy-mm-dd ('2013-06-30' to '2017-08-30')
3. Present_Tmax - Maximum air temperature between 0 and 21 h on the present day (Â°C): 20 to 37.6
4. Present_Tmin - Minimum air temperature between 0 and 21 h on the present day (Â°C): 11.3 to 29.9
5. LDAPS_RHmin - LDAPS model forecast of next-day minimum relative humidity (%): 19.8 to 98.5
6. LDAPS_RHmax - LDAPS model forecast of next-day maximum relative humidity (%): 58.9 to 100
7. LDAPS_Tmax_lapse - LDAPS model forecast of next-day maximum air temperature applied lapse rate (Â°C): 17.6 to 38.5
8. LDAPS_Tmin_lapse - LDAPS model forecast of next-day minimum air temperature applied lapse rate (Â°C): 14.3 to 29.6
9. LDAPS_WS - LDAPS model forecast of next-day average wind speed (m/s): 2.9 to 21.9
10. LDAPS_LH - LDAPS model forecast of next-day average latent heat flux (W/m2): -13.6 to 213.4
11. LDAPS_CC1 - LDAPS model forecast of next-day 1st 6-hour split average cloud cover (0-5 h) (%): 0 to 0.97
12. LDAPS_CC2 - LDAPS model forecast of next-day 2nd 6-hour split average cloud cover (6-11 h) (%): 0 to 0.97
13. LDAPS_CC3 - LDAPS model forecast of next-day 3rd 6-hour split average cloud cover (12-17 h) (%): 0 to 0.98
14. LDAPS_CC4 - LDAPS model forecast of next-day 4th 6-hour split average cloud cover (18-23 h) (%): 0 to 0.97
15. LDAPS_PPT1 - LDAPS model forecast of next-day 1st 6-hour split average precipitation (0-5 h) (%): 0 to 23.7
16. LDAPS_PPT2 - LDAPS model forecast of next-day 2nd 6-hour split average precipitation (6-11 h) (%): 0 to 21.6
17. LDAPS_PPT3 - LDAPS model forecast of next-day 3rd 6-hour split average precipitation (12-17 h) (%): 0 to 15.8
18. LDAPS_PPT4 - LDAPS model forecast of next-day 4th 6-hour split average precipitation (18-23 h) (%): 0 to 16.7
19. lat - Latitude (Â°): 37.456 to 37.645
20. lon - Longitude (Â°): 126.826 to 127.135
21. DEM - Elevation (m): 12.4 to 212.3
22. Slope - Slope (Â°): 0.1 to 5.2
23. Solar radiation - Daily incoming solar radiation (wh/m2): 4329.5 to 5992.9
24. Next_Tmax - The next-day maximum air temperature (Â°C): 17.4 to 38.9
25. Next_Tmin - The next-day minimum air temperature (Â°C): 11.3 to 29.8T

You have to build separate models that can predict the minimum temperature for the next day and the maximum temperature for the next day based on the details provided in the dataset.

Dataset Link-

https://github.com/dsrscientist/Dataset2/blob/main/temperature.csv


In [2]:
import pandas as pd
url = "https://raw.githubusercontent.com/dsrscientist/Dataset2/main/temperature.csv"
df = pd.read_csv(url)
print("Dataset Shape:", df.shape)
print("\nColumns in the Dataset:", df.columns)
print("\nData Types of Columns:\n", df.dtypes)
print("\nSummary Statistics:\n", df.describe())
print("\nFirst 5 Rows of the Dataset:\n", df.head())
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
url = "https://raw.githubusercontent.com/dsrscientist/Dataset2/main/temperature.csv"
df = pd.read_csv(url)
print("Missing Values:\n", df.isnull().sum())
df['Date'] = pd.to_datetime(df['Date'])
features = df.drop(['Next_Tmax', 'Next_Tmin'], axis=1)
target_tmax = df['Next_Tmax']
target_tmin = df['Next_Tmin']
features_train, features_test, target_tmax_train, target_tmax_test, target_tmin_train, target_tmin_test = train_test_split(
    features, target_tmax, target_tmin, test_size=0.2, random_state=42
)
import pandas as pd
from sklearn.model_selection import train_test_split
url = "https://raw.githubusercontent.com/dsrscientist/Dataset2/main/temperature.csv"
df = pd.read_csv(url)
df['Date'] = pd.to_datetime(df['Date'])
train_start_date = '2013-01-01'
train_end_date = '2016-12-31'
test_start_date = '2017-01-01'
test_end_date = '2017-12-31'
train_data = df[(df['Date'] >= train_start_date) & (df['Date'] <= train_end_date)]
test_data = df[(df['Date'] >= test_start_date) & (df['Date'] <= test_end_date)]
features_train = train_data.drop(['Next_Tmax', 'Next_Tmin'], axis=1)
target_tmax_train = train_data['Next_Tmax']
target_tmin_train = train_data['Next_Tmin']
features_test = test_data.drop(['Next_Tmax', 'Next_Tmin'], axis=1)
target_tmax_test = test_data['Next_Tmax']
target_tmin_test = test_data['Next_Tmin']
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
features_train_scaled = scaler.fit_transform(features_train)
features_test_scaled = scaler.transform(features_test)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
features_train_normalized = scaler.fit_transform(features_train)
features_test_normalized = scaler.transform(features_test)
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GridSearchCV
linear_reg_tmax = LinearRegression()
linear_reg_tmax.fit(features_train_scaled, target_tmax_train)
linear_reg_tmin = LinearRegression()
linear_reg_tmin.fit(features_train_scaled, target_tmin_train)
tree_reg_tmax = DecisionTreeRegressor(random_state=42)
tree_reg_tmax.fit(features_train_scaled, target_tmax_train)
tree_reg_tmin = DecisionTreeRegressor(random_state=42)
tree_reg_tmin.fit(features_train_scaled, target_tmin_train)
rf_reg_tmax = RandomForestRegressor(random_state=42)
rf_reg_tmax.fit(features_train_scaled, target_tmax_train)
rf_reg_tmin = RandomForestRegressor(random_state=42)
rf_reg_tmin.fit(features_train_scaled, target_tmin_train)
gb_reg_tmax = GradientBoostingRegressor(random_state=42)
gb_reg_tmax.fit(features_train_scaled, target_tmax_train)
gb_reg_tmin = GradientBoostingRegressor(random_state=42)
gb_reg_tmin.fit(features_train_scaled, target_tmin_train)
def evaluate_model(model, features, target):
    predictions = model.predict(features)
    mae = mean_absolute_error(target, predictions)
    return mae
linear_reg_tmax_mae = evaluate_model(linear_reg_tmax, features_test_scaled, target_tmax_test)
linear_reg_tmin_mae = evaluate_model(linear_reg_tmin, features_test_scaled, target_tmin_test)
tree_reg_tmax_mae = evaluate_model(tree_reg_tmax, features_test_scaled, target_tmax_test)
tree_reg_tmin_mae = evaluate_model(tree_reg_tmin, features_test_scaled, target_tmin_test)
rf_reg_tmax_mae = evaluate_model(rf_reg_tmax, features_test_scaled, target_tmax_test)
rf_reg_tmin_mae = evaluate_model(rf_reg_tmin, features_test_scaled, target_tmin_test)
gb_reg_tmax_mae = evaluate_model(gb_reg_tmax, features_test_scaled, target_tmax_test)
gb_reg_tmin_mae = evaluate_model(gb_reg_tmin, features_test_scaled, target_tmin_test)
print("Linear Regression - Next_Tmax MAE:", linear_reg_tmax_mae)
print("Linear Regression - Next_Tmin MAE:", linear_reg_tmin_mae)
print("Decision Tree - Next_Tmax MAE:", tree_reg_tmax_mae)
print("Decision Tree - Next_Tmin MAE:", tree_reg_tmin_mae)
print("Random Forest - Next_Tmax MAE:", rf_reg_tmax_mae)
print("Random Forest - Next_Tmin MAE:", rf_reg_tmin_mae)
print("Gradient Boosting - Next_Tmax MAE:", gb_reg_tmax_mae)
print("Gradient Boosting - Next_Tmin MAE:", gb_reg_tmin_mae)
def evaluate_model(model, features, target):
    predictions = model.predict(features)
    mae = mean_absolute_error(target, predictions)
    return mae
linear_reg_tmax_mae = evaluate_model(linear_reg_tmax, features_test_scaled, target_tmax_test)
linear_reg_tmin_mae = evaluate_model(linear_reg_tmin, features_test_scaled, target_tmin_test)
tree_reg_tmax_mae = evaluate_model(tree_reg_tmax, features_test_scaled, target_tmax_test)
tree_reg_tmin_mae = evaluate_model(tree_reg_tmin, features_test_scaled, target_tmin_test)
rf_reg_tmax_mae = evaluate_model(rf_reg_tmax, features_test_scaled, target_tmax_test)
rf_reg_tmin_mae = evaluate_model(rf_reg_tmin, features_test_scaled, target_tmin_test)
gb_reg_tmax_mae = evaluate_model(gb_reg_tmax, features_test_scaled, target_tmax_test)
gb_reg_tmin_mae = evaluate_model(gb_reg_tmin, features_test_scaled, target_tmin_test)
print("Linear Regression - Next_Tmax MAE:", linear_reg_tmax_mae)
print("Linear Regression - Next_Tmin MAE:", linear_reg_tmin_mae)
print("Decision Tree - Next_Tmax MAE:", tree_reg_tmax_mae)
print("Decision Tree - Next_Tmin MAE:", tree_reg_tmin_mae)
print("Random Forest - Next_Tmax MAE:", rf_reg_tmax_mae)
print("Random Forest - Next_Tmin MAE:", rf_reg_tmin_mae)
print("Gradient Boosting - Next_Tmax MAE:", gb_reg_tmax_mae)
print("Gradient Boosting - Next_Tmin MAE:", gb_reg_tmin_mae)
tree_tmax_feature_importance = tree_reg_tmax.feature_importances_
tree_tmin_feature_importance = tree_reg_tmin.feature_importances_
rf_tmax_feature_importance = rf_reg_tmax.feature_importances_
rf_tmin_feature_importance = rf_reg_tmin.feature_importances_
gb_tmax_feature_importance = gb_reg_tmax.feature_importances_
gb_tmin_feature_importance = gb_reg_tmin.feature_importances_
tree_tmax_feature_importance_df = pd.DataFrame({'Feature': features_train.columns, 'Importance': tree_tmax_feature_importance})
tree_tmin_feature_importance_df = pd.DataFrame({'Feature': features_train.columns, 'Importance': tree_tmin_feature_importance})
rf_tmax_feature_importance_df = pd.DataFrame({'Feature': features_train.columns, 'Importance': rf_tmax_feature_importance})
rf_tmin_feature_importance_df = pd.DataFrame({'Feature': features_train.columns, 'Importance': rf_tmin_feature_importance})
gb_tmax_feature_importance_df = pd.DataFrame({'Feature': features_train.columns, 'Importance': gb_tmax_feature_importance})
gb_tmin_feature_importance_df = pd.DataFrame({'Feature': features_train.columns, 'Importance': gb_tmin_feature_importance})
print("Decision Tree - Next_Tmax Feature Importance:")
print(tree_tmax_feature_importance_df.sort_values(by='Importance', ascending=False))
print("\nDecision Tree - Next_Tmin Feature Importance:")
print(tree_tmin_feature_importance_df.sort_values(by='Importance', ascending=False))
print("\nRandom Forest - Next_Tmax Feature Importance:")
print(rf_tmax_feature_importance_df.sort_values(by='Importance', ascending=False))
print("\nRandom Forest - Next_Tmin Feature Importance:")
print(rf_tmin_feature_importance_df.sort_values(by='Importance', ascending=False))
print("\nGradient Boosting - Next_Tmax Feature Importance:")
print(gb_tmax_feature_importance_df.sort_values(by='Importance', ascending=False))
print("\nGradient Boosting - Next_Tmin Feature Importance:")
print(gb_tmin_feature_importance_df.sort_values(by='Importance', ascending=False))

TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond